Cannot recover index because of missing tanslog files

Good day.

I have ES 2.1.0 with Kibana in single node demo environment. After unclean shutdown, .kibana index refuses to come up because of the following error in the logs:

[2016-01-04 18:52:06,313][WARN ][cluster.action.shard     ] [One Above All] [.kibana][0] received shard failed for [.kibana][0], node[gF-MmFXhRD6RpzhoX4j0fw], [P], v[480], s[INITIALIZING], a[id=WOf4yH4-SRql1VcU667asQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-01-04T16:52:06.306Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog]; ]], indexUUID [fUvsCJ_XTMy29AhHwArEww], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog]; ]
[.kibana][[.kibana][0]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog];
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:254)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [.kibana][[.kibana][0]] EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog];
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:156)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
        ... 5 more
Caused by: java.nio.file.NoSuchFileException: /var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/translog-44.tlog
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)

...
(I've cut the last lines because of post size exception limit. The full exception is here: http://pastebin.com/raw/zWAjwxx3)

The message keeps filling the log file and eating up gigabytes of disk space (even after upgrading to ES 2.1.1)

The file indeed does not exist. Here are the contents of /var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog/

$ ls -lah /var/lib/elasticsearch/zaar/nodes/0/indices/.kibana/0/translog
total 112K
drwxr-xr-x 2 elasticsearch elasticsearch 4.0K Jan  4 18:56 .
drwxr-xr-x 5 elasticsearch elasticsearch 4.0K Jan  4 17:25 ..
-rw-r--r-- 1 elasticsearch elasticsearch   20 Jan  3 19:18 translog-43.ckp
-rw-r--r-- 1 elasticsearch elasticsearch  96K Jan  3 19:12 translog-43.tlog
-rw-r--r-- 1 elasticsearch elasticsearch   20 Jan  3 19:19 translog.ckp

What I have tried so far:

  • Upgrade to 2.1.1
  • Remove translog files fully or partially (ES was stopped when I've fiddled with the files)

Nothing helped.

How can I tell force ES to start with whatever there is the in the index and start transaction log from scratch?

The index data itself seems to be fine. Please don't tell me that one unclean shutdown can kill the whole index.

I've checked the index itself using the following: http://pastebin.com/raw/LnBDtJXP

Thank you in advance!

That's very peculiar and shouldn't happen (famous last words). Any chance you can share the logs of that node for the period of the non-clean shut down? We might see something there..

I had to remove them all because this message bloats the quite rapidly. It went to over 20GB in just several hours.
If I were thinking, I would save them aside. Will remember to do it next time hopefully.

But the question remains - cause aside, is there way to recover? Again, index data seems to be there...

P.S. The fact that in the current scenario ES bloats logs like crazy seems like another bug to me.

Another idea - the index is configured to have one replica. If I fire another node (it's single node currently) and join it to the cluster, is there a way to assign the replica there and promote it to be primary shard?

would you be able to send me the translog.ckp and translog-43.ckp files?

I also wonder what do you mean by After unclean shutdown, can you elaborate?

When did you get this exception the first time, was it after restarting?

I've sent you link to translog files in private message.

My Unclean shutdown I mean:

I hit power button on my desktop, which triggered ACPI shutdown. Waited until the computer shuts off. Relocated it to another room, pressed the power button and when home. Several hours later I've tried to connect to kibana and discovered that its index in unavailable and ES produced over 20GB of logs. I've checked the last log, saw the aforementioned exception and started to investigate.

The kibana index was fully functional before the shutdown.

Thank you very much again for looking into this.

ok I think I found an issue which can cause this problem. I will update you on it in a bit

@haizzar thanks for baring with me and sending me all the info. I found the problem and we will have a fix for it shortly committed here: https://github.com/elastic/elasticsearch/pull/15788

I checked your ckp files and the one that is missing the translog-44.tlog file has indeed 0 operations recorded. Now since this is transactional what you can do is you can rename the translog-43.ckp to translog.ckp and startup the node. This should recover it for you just fine. Please lemme know if that worked for you.

2 Likes

@s1monw thank you so much for the solution! Worked perfectly. Index recovered and it working OK. I'll make to remember it next time that it (should not ::slightly_smiling:) happen.

P.S. Apologies for the late response - I was on vacation.

I've had a corrupted translog now a couple of times after power failures. There seems to be no way of recovering from it, ES (2.3, 2.3.1, 2,3.2) just keeps trying and then eventually I give up and have to completely delete the index. Not a nice situation. I mean, it could be worse (if ES insisted on the index being there, which it fortunately doesn't do), but really it should start up without a translog and not go into an endless loop.

Stefan

2 Likes

Worked for me. Thanks a bunch :slight_smile: