Elasticsearch Fills Logs with Error Messages When Shard Fails to Recover


#1

Elasticsearch version: 2.3.1

I am seeing an issue where Elasticsearch is filling up disk space (over 40GB) with error logging. Its the same exception over and over again:

[2016-08-16 15:59:57,827][WARN ][cluster.action.shard ] [node] [geocortex.core.roles.elasticsearch.watche
0] received shard failed for target shard [[geocortex.core.roles.elasticsearch.watcher][0], node[btC68MOzSHKQPa-ANpR
], [P], v[35], s[INITIALIZING], a[id=8qt4DOJkQVahfclZFGjqmg], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-08
T15:59:50.811Z]]], indexUUID [K5GAD2zsSbaQ9GhWnImH2Q], message [failed recovery], failure [IndexShardRecoveryExcepti
failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFExcept
; ]
[geocortex.core.roles.elasticsearch.watcher][[geocortex.core.roles.elasticsearch.watcher][0]] IndexShardRecoveryExce
on[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFExc
ion;
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [geocortex.core.roles.elasticsearch.watcher][[geocortex.core.roles.elasticsearch.watcher][0]] EngineCreat
FailureException[failed to create engine]; nested: EOFException;
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1515)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1499)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:972)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:944)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
... 5 more
Caused by: java.io.EOFException
at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)
at org.elasticsearch.index.translog.Checkpoint.(Checkpoint.java:54)
at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:83)
at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:337)
at org.elasticsearch.index.translog.Translog.(Translog.java:179)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:151)
... 11 more

The error happens as soon as Elasticsearch starts and tries to recover the indices. It looks to me like an error while reading the translog fie. Unfortunately, I do not have the translog file anymore. Any one have any idea why this is happening?


(Adrien Grand) #2

It looks like your translog file has been truncated somehow, which prevents Elasticsearch from performing the recovery. If you don't mind losing some data, you could remove the translog files from disk and restart elasticsearch.


#3

I can reproduce the issue by manually truncating one of the translog .ckp files. Is there a way to prevent Elasticsearch from filling the log files with GB of data? Possibly stop trying to recover the index after it fails x number of times?


(system) #4