Translog is corrupted

Hello,

I have a cluster of 3 ES nodes that uses 7.10.1 version. One node had an HDD failure last night. We have bring it up and almost all indices have recovered successfully.

However, there are 4 indices that fails to recover with the messages like:

[2021-09-27T11:24:40,917][WARN ][o.e.i.c.IndicesClusterStateService] [elk2] [logstash-xxx][0] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-xxx][0]: Recovery failed on {elk2}{2G3ZbsFCRO-IQwXRtmZcZA}{JmIIMdRARYKFFDpfTwjICA}{elk2}{xx.xx.xx.xx:9300}{cdhimrstw}{xpack.installed=true, transform.node=true}
        at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:2676) [elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) [elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:355) [elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) [elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:328) [elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:96) [elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1894) [elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) [elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.1.jar:7.10.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:441) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:98) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325) ~[elasticsearch-7.10.1.jar:7.10.1]
        ... 8 more
Caused by: org.elasticsearch.index.engine.EngineException: failed to recover from translog
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:501) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:474) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:125) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1621) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:436) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:98) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325) ~[elasticsearch-7.10.1.jar:7.10.1]
        ... 8 more
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [/data/es_data/nodes/0/indices/Q6LK3rm0QWmbCe_9iuQ29Q/0/translog/translog-23408.tlog] is corrupted, translog truncated
        at org.elasticsearch.index.translog.TranslogSnapshot.readBytes(TranslogSnapshot.java:107) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.BaseTranslogReader.readSize(BaseTranslogReader.java:79) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.TranslogSnapshot.readOperation(TranslogSnapshot.java:80) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.TranslogSnapshot.next(TranslogSnapshot.java:70) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.Translog$SeqNoFilterSnapshot.next(Translog.java:972) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:1565) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.IndexShard.lambda$openEngineAndRecoverFromTranslog$9(IndexShard.java:1616) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:499) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:474) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:125) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1621) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:436) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:98) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325) ~[elasticsearch-7.10.1.jar:7.10.1]
        ... 8 more
Caused by: java.io.EOFException: read past EOF. pos [7419] length: [4] end: [7419]
        at org.elasticsearch.common.io.Channels.readFromFileChannelWithEofException(Channels.java:103) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.TranslogSnapshot.readBytes(TranslogSnapshot.java:105) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.BaseTranslogReader.readSize(BaseTranslogReader.java:79) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.TranslogSnapshot.readOperation(TranslogSnapshot.java:80) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.TranslogSnapshot.next(TranslogSnapshot.java:70) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.translog.Translog$SeqNoFilterSnapshot.next(Translog.java:972) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:1565) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.IndexShard.lambda$openEngineAndRecoverFromTranslog$9(IndexShard.java:1616) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:499) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:474) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:125) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1621) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:436) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:98) ~[elasticsearch-7.10.1.jar:7.10.1]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325) ~[elasticsearch-7.10.1.jar:7.10.1]
        ... 8 more

In the thread Corrupted translog you recommended a command line tool elasticsearch-translog, but I can't find it in the 7.10.1 version. Did you remove it or where I can find it?

If the tool is unavailable, is there a way to remove only corrupted documents instead of deleting the whole index? Maybe deleting only corrupted translog-*.tlog will be enought?

Regards,
Ivan

Any ideas?

It seems that elasticsearch-translog has been renamed to elasticsearch-shard: elasticsearch-shard | Elasticsearch Guide [master] | Elastic

Clearing translog files through elasticsearch-shard fixed my indices.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.