ES failed to recover from translog corruption


#1

Hi,

My ES single node server was forcibly shutdown, and it hasn't recovered from that. Now, ES is stuck in this error mode. How do I recover from this, and start ES? Thanks.

I am running Version: 6.3.1, Build: default/deb/eb782d0/2018-06-29T21:59:26.107521Z, JVM: 1.8.0_171

The following are the relevant logs for ES on restarting.

[2018-11-18T10:06:50,537][INFO ][o.e.n.Node               ] [p0HJEer] initialized
[2018-11-18T10:06:50,539][INFO ][o.e.n.Node               ] [p0HJEer] starting ...
[2018-11-18T10:06:50,875][INFO ][o.e.t.TransportService   ] [p0HJEer] publish_address {10.0.3.13:9300}, bound_addresses {[::]:9300}
[2018-11-18T10:06:51,120][INFO ][o.e.b.BootstrapChecks    ] [p0HJEer] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-11-18T10:06:54,244][INFO ][o.e.c.s.MasterService    ] [p0HJEer] zen-disco-elected-as-master ([0] nodes joined)[, ], reason: new_master {p0HJEer}{p0HJEerWRvC0mTRHWfA7-A}{k0C7PgTPR0qtFWkPiI-nhw}{10.0.3.13}{10.0.3.13:9300}{ml.machine_memory=4124848128, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
[2018-11-18T10:06:54,258][INFO ][o.e.c.s.ClusterApplierService] [p0HJEer] new_master {p0HJEer}{p0HJEerWRvC0mTRHWfA7-A}{k0C7PgTPR0qtFWkPiI-nhw}{10.0.3.13}{10.0.3.13:9300}{ml.machine_memory=4124848128, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, reason: apply cluster state (from master [master {p0HJEer}{p0HJEerWRvC0mTRHWfA7-A}{k0C7PgTPR0qtFWkPiI-nhw}{10.0.3.13}{10.0.3.13:9300}{ml.machine_memory=4124848128, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)[, ]]])
[2018-11-18T10:06:54,314][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [p0HJEer] publish_address {10.0.3.13:9200}, bound_addresses {[::]:9200}
[2018-11-18T10:06:54,315][INFO ][o.e.n.Node               ] [p0HJEer] started
[2018-11-18T10:06:58,964][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [p0HJEer] Failed to clear cache for realms [[]]
[2018-11-18T10:06:59,116][INFO ][o.e.l.LicenseService     ] [p0HJEer] license [93cf2962-5aff-40a7-a43c-325357b2bf79] mode [basic] - valid
[2018-11-18T10:06:59,177][INFO ][o.e.g.GatewayService     ] [p0HJEer] recovered [138] indices into cluster_state
[2018-11-18T10:07:02,728][WARN ][o.e.i.e.Engine           ] [p0HJEer] [logstash-2018.11.18][1] failed engine [failed to recover from translog]
org.elasticsearch.index.engine.EngineException: failed to recover from translog
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:406) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:377) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:98) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1297) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:420) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1567) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$5(IndexShard.java:2020) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:626) [elasticsearch-6.3.1.jar:6.3.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: operation size must be at least 4 but was: 0
        at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1460) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.translog.BaseTranslogReader.read(BaseTranslogReader.java:110) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.translog.TranslogSnapshot.readOperation(TranslogSnapshot.java:73) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.translog.TranslogSnapshot.next(TranslogSnapshot.java:64) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:69) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:1262) ~[elasticsearch-6.3.1.jar:6.3.1]
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:404) ~[elasticsearch-6.3.1.jar:6.3.1]
        ... 13 more

(David Turner) #2

I think you will need to run the translog truncation tool. This may lose some data, I'm afraid.

I also suspect that this indicates that your system is not properly honouring fsync() calls, which you might want to investigate.


#3

Thanks David. How do I find the right directory to the translog in question? There are 138 subdirs in /var/lib/elasticsearch/nodes/0/indices/ and additional subdirs under them.


(David Turner) #4

First of all, I should have mentioned that if you can restore this shard from a recent snapshot then that would be preferable. Truncating the translog will lose some unknown amount of data, whereas at least with a recent snapshot you know how large a time window is missing.

That's a very good question. We've improved the error messages (and the tool) in 6.5 to answer this, but in 6.3.1 you need to work it out by hand. The log indicates the broken shard is [logstash-2018.11.18][1] so you need the index UUID for logstash-2018.11.18:

GET /_cat/indices/logstash-2018.11.18

For instance this might output:

green open logstash-2018.11.18 8JNp_LXpRCab3VPlMoWvew 2 0 0 0 230b 230b
                               ^^^^^^^^^^^^^^^^^^^^^^ this is the index UUID

Then (in 6.3.1 at least) the directory is

/var/lib/elasticsearch/nodes/0/indices/4Fe-NDPSQqytyx4Jg3I95Q/1/translog

#5

Thank you. I was able to solve the problem. This is our QA environment, and being a Sunday and we don't snapshot QA, I have no problems losing some records.

However, there were 5 generations inside the index. I was not able to find out which generation to choose. I just truncated them all, and it apparently worked. For future references, how would someone go about finding out the generation number?

curl http://localhost:9200/_cat/indices/logstash-2018.11.18
yellow open logstash-2018.11.18 vnzRrSS-TAKGBfdEYv2T7A 5 1 49935 0 3.8mb 3.8mb

(David Turner) #6

I didn't know that this was needed. In 6.5.0 it'd be in the error message, but I thought the only option was to truncate the whole translog.