Cannot recover index because of missing tanslog files

haizaar · January 4, 2016, 5:18pm

Good day.

I have ES 2.1.0 with Kibana in single node demo environment. After unclean shutdown, .kibana index refuses to come up because of the following error in the logs:

[2016-01-04 18:52:06,313][WARN ][cluster.action.shard     ] [One Above All] [.kibana][0] received shard failed for [.kibana][0], node[gF-MmFXhRD6RpzhoX4j0fw], [P], v[480], s[INITIALIZING], a[id=WOf4yH4-SRql1VcU667asQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-01-04T16:52:06.306Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog]; ]], indexUUID [fUvsCJ_XTMy29AhHwArEww], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog]; ]
[.kibana][[.kibana][0]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog];
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:254)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [.kibana][[.kibana][0]] EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog];
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:156)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        ... 5 more
Caused by: java.nio.file.NoSuchFileException: /var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)

...
(I've cut the last lines because of post size exception limit. The full exception is here: http://pastebin.com/raw/zWAjwxx3)

The message keeps filling the log file and eating up gigabytes of disk space (even after upgrading to ES 2.1.1)

The file indeed does not exist. Here are the contents of /var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/

$ ls -lah /var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog
total 112K
drwxr-xr-x 2 elasticsearch elasticsearch 4.0K Jan  4 18:56 .
drwxr-xr-x 5 elasticsearch elasticsearch 4.0K Jan  4 17:25 ..
-rw-r--r-- 1 elasticsearch elasticsearch   20 Jan  3 19:18 translog-43.ckp
-rw-r--r-- 1 elasticsearch elasticsearch  96K Jan  3 19:12 translog-43.tlog
-rw-r--r-- 1 elasticsearch elasticsearch   20 Jan  3 19:19 translog.ckp

What I have tried so far:

Upgrade to 2.1.1
Remove translog files fully or partially (ES was stopped when I've fiddled with the files)

Nothing helped.

How can I tell force ES to start with whatever there is the in the index and start transaction log from scratch?

The index data itself seems to be fine. Please don't tell me that one unclean shutdown can kill the whole index.

I've checked the index itself using the following: http://pastebin.com/raw/LnBDtJXP

Thank you in advance!

bleskes · January 4, 2016, 9:42pm

That's very peculiar and shouldn't happen (famous last words). Any chance you can share the logs of that node for the period of the non-clean shut down? We might see something there..

haizaar · January 4, 2016, 10:16pm

I had to remove them all because this message bloats the quite rapidly. It went to over 20GB in just several hours.
If I were thinking, I would save them aside. Will remember to do it next time hopefully.

But the question remains - cause aside, is there way to recover? Again, index data seems to be there...

P.S. The fact that in the current scenario ES bloats logs like crazy seems like another bug to me.

haizaar · January 4, 2016, 10:20pm

Another idea - the index is configured to have one replica. If I fire another node (it's single node currently) and join it to the cluster, is there a way to assign the replica there and promote it to be primary shard?

s1monw · January 5, 2016, 2:55pm

would you be able to send me the translog.ckp and translog-43.ckp files?

I also wonder what do you mean by After unclean shutdown, can you elaborate?

When did you get this exception the first time, was it after restarting?

haizaar · January 5, 2016, 8:47pm

I've sent you link to translog files in private message.

My Unclean shutdown I mean:

I hit power button on my desktop, which triggered ACPI shutdown. Waited until the computer shuts off. Relocated it to another room, pressed the power button and when home. Several hours later I've tried to connect to kibana and discovered that its index in unavailable and ES produced over 20GB of logs. I've checked the last log, saw the aforementioned exception and started to investigate.

The kibana index was fully functional before the shutdown.

Thank you very much again for looking into this.

s1monw · January 6, 2016, 10:45am

ok I think I found an issue which can cause this problem. I will update you on it in a bit

s1monw · January 6, 2016, 12:21pm

@haizzar thanks for baring with me and sending me all the info. I found the problem and we will have a fix for it shortly committed here: https://github.com/elastic/elasticsearch/pull/15788

I checked your ckp files and the one that is missing the translog-44.tlog file has indeed 0 operations recorded. Now since this is transactional what you can do is you can rename the translog-43.ckp to translog.ckp and startup the node. This should recover it for you just fine. Please lemme know if that worked for you.

haizaar · January 8, 2016, 9:51pm

@s1monw thank you so much for the solution! Worked perfectly. Index recovered and it working OK. I'll make to remember it next time that it (should not :) happen.

P.S. Apologies for the late response - I was on vacation.

seidler2547 · April 29, 2016, 1:32pm

I've had a corrupted translog now a couple of times after power failures. There seems to be no way of recovering from it, ES (2.3, 2.3.1, 2,3.2) just keeps trying and then eventually I give up and have to completely delete the index. Not a nice situation. I mean, it could be worse (if ES insisted on the index being there, which it fortunately doesn't do), but really it should start up without a translog and not go into an endless loop.

Stefan

naveennitk2009 · April 17, 2017, 7:50am

Worked for me. Thanks a bunch

Topic		Replies	Views
How to recover indices from missing translog file Elasticsearch	2	3076	July 5, 2017
Translog files corrupted, cluster failing to recover Elasticsearch	2	1746	July 5, 2017
NoSuchFileException: translog-5.ckp Elasticsearch	2	1999	July 5, 2017
Kibana failed to connect to Elastic server due to TranslogCorruptedException Elasticsearch	1	746	December 16, 2016
NoSuchFileException translog.ckp missing problem Elasticsearch	1	678	May 31, 2018

Cannot recover index because of missing tanslog files

Related topics