ES failed to recover from translog corruption

nebulae · November 18, 2018, 3:42pm

Hi,

My ES single node server was forcibly shutdown, and it hasn't recovered from that. Now, ES is stuck in this error mode. How do I recover from this, and start ES? Thanks.

I am running Version: 6.3.1, Build: default/deb/eb782d0/2018-06-29T21:59:26.107521Z, JVM: 1.8.0_171

The following are the relevant logs for ES on restarting.

[2018-11-18T10:06:50,537][INFO ][o.e.n.Node               ] [p0HJEer] initialized
[2018-11-18T10:06:50,539][INFO ][o.e.n.Node               ] [p0HJEer] starting ...
[2018-11-18T10:06:50,875][INFO ][o.e.t.TransportService   ] [p0HJEer] publish_address {10.0.3.13:9300}, bound_addresses {[::]:9300}
[2018-11-18T10:06:51,120][INFO ][o.e.b.BootstrapChecks    ] [p0HJEer] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-11-18T10:06:54,244][INFO ][o.e.c.s.MasterService    ] [p0HJEer] zen-disco-elected-as-master ([0] nodes joined)[, ], reason: new_master {p0HJEer}{p0HJEerWRvC0mTRHWfA7-A}{k0C7PgTPR0qtFWkPiI-nhw}{10.0.3.13}{10.0.3.13:9300}{ml.machine_memory=4124848128, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
[2018-11-18T10:06:54,258][INFO ][o.e.c.s.ClusterApplierService] [p0HJEer] new_master {p0HJEer}{p0HJEerWRvC0mTRHWfA7-A}{k0C7PgTPR0qtFWkPiI-nhw}{10.0.3.13}{10.0.3.13:9300}{ml.machine_memory=4124848128, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, reason: apply cluster state (from master [master {p0HJEer}{p0HJEerWRvC0mTRHWfA7-A}{k0C7PgTPR0qtFWkPiI-nhw}{10.0.3.13}{10.0.3.13:9300}{ml.machine_memory=4124848128, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)[, ]]])
[2018-11-18T10:06:54,314][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [p0HJEer] publish_address {10.0.3.13:9200}, bound_addresses {[::]:9200}
[2018-11-18T10:06:54,315][INFO ][o.e.n.Node               ] [p0HJEer] started
[2018-11-18T10:06:58,964][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [p0HJEer] Failed to clear cache for realms [[]]
[2018-11-18T10:06:59,116][INFO ][o.e.l.LicenseService     ] [p0HJEer] license [93cf2962-5aff-40a7-a43c-325357b2bf79] mode [basic] - valid
[2018-11-18T10:06:59,177][INFO ][o.e.g.GatewayService     ] [p0HJEer] recovered [138] indices into cluster_state
[2018-11-18T10:07:02,728][WARN ][o.e.i.e.Engine           ] [p0HJEer] [logstash-2018.11.18][1] failed engine [failed to recover from translog]
org.elasticsearch.index.engine.EngineException: failed to recover from translog
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:406) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:377) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:98) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1297) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:420) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1567) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$5(IndexShard.java:2020) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:626) [elasticsearch-6.3.1.jar:6.3.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: operation size must be at least 4 but was: 0
        at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1460) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.translog.BaseTranslogReader.read(BaseTranslogReader.java:110) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.translog.TranslogSnapshot.readOperation(TranslogSnapshot.java:73) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.translog.TranslogSnapshot.next(TranslogSnapshot.java:64) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:69) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:1262) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:404) ~[elasticsearch-6.3.1.jar:6.3.1]
        ... 13 more

DavidTurner · November 18, 2018, 5:58pm

I think you will need to run the translog truncation tool. This may lose some data, I'm afraid.

I also suspect that this indicates that your system is not properly honouring fsync() calls, which you might want to investigate.

nebulae · November 18, 2018, 8:47pm

Thanks David. How do I find the right directory to the translog in question? There are 138 subdirs in /var/lib/elasticsearch/nodes/0/indices/ and additional subdirs under them.

DavidTurner · November 18, 2018, 11:06pm

First of all, I should have mentioned that if you can restore this shard from a recent snapshot then that would be preferable. Truncating the translog will lose some unknown amount of data, whereas at least with a recent snapshot you know how large a time window is missing.

That's a very good question. We've improved the error messages (and the tool) in 6.5 to answer this, but in 6.3.1 you need to work it out by hand. The log indicates the broken shard is [logstash-2018.11.18][1] so you need the index UUID for logstash-2018.11.18:

GET /_cat/indices/logstash-2018.11.18

For instance this might output:

green open logstash-2018.11.18 8JNp_LXpRCab3VPlMoWvew 2 0 0 0 230b 230b
                               ^^^^^^^^^^^^^^^^^^^^^^ this is the index UUID

Then (in 6.3.1 at least) the directory is

/var/lib/elasticsearch/nodes/0/indices/4Fe-NDPSQqytyx4Jg3I95Q/1/translog

nebulae · November 19, 2018, 3:05am

Thank you. I was able to solve the problem. This is our QA environment, and being a Sunday and we don't snapshot QA, I have no problems losing some records.

However, there were 5 generations inside the index. I was not able to find out which generation to choose. I just truncated them all, and it apparently worked. For future references, how would someone go about finding out the generation number?

curl http://localhost:9200/_cat/indices/logstash-2018.11.18
yellow open logstash-2018.11.18 vnzRrSS-TAKGBfdEYv2T7A 5 1 49935 0 3.8mb 3.8mb

DavidTurner · November 19, 2018, 6:19am

I didn't know that this was needed. In 6.5.0 it'd be in the error message, but I thought the only option was to truncate the whole translog.

system · December 17, 2018, 6:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Corrupted translog Elasticsearch	18	8322	June 27, 2017
Translog is corrupted Elasticsearch	3	3484	November 1, 2021
Failed to recover from translog Elasticsearch	3	2063	July 5, 2017
Elasticsearch org.elasticsearch.index.engine.EngineException"failed to recover translog" Elasticsearch	5	856	December 25, 2019
Translog files corrupted, cluster failing to recover Elasticsearch	2	1734	July 5, 2017

ES failed to recover from translog corruption

Related topics