Failed shard recovery after hard shutdown

kamal · December 15, 2018, 12:18pm

Hi
I was doing a huge indexing job(about 400 billion records), but suddenly one of the nodes went down because the power failed, after fixing the power, one shard is missing, and the error is:

shard failure, reason [failed to recover from translog], failure EngineException, nested: EOFException[read past EOF. pos [4590678] length: [4] end: [4590678].
cannot allocate because allocation is not permitted to any of nodes that hold an in-sync shard copy.

As it was a huge indexing, (and still is running very slow after the problem), there is no replica.
What should I do?

DavidTurner · December 15, 2018, 3:49pm

The translog was not properly written to disk before the power outage and is now corrupt. Did you set index.translog.durability: async? If not, my guess is that your storage hardware does not properly support the fsync() call, claiming to have persisted some writes before actually having done so.

The shard in question is broken, and the only truly reliable way forwards is to start again. You can wipe out the corrupt translog using the elasticsearch-translog tool (or elasticsearch-shard if in 6.5 or later) which will lose any writes that were not also written to Lucene. There's no way to tell which writes will be lost, unless you can somehow compare the data in Elasticsearch to your source data and fix it up.

kamal · December 18, 2018, 7:24am

I didn't set index.translog.durability: async, and the filesystem is ext4 and disks are raid 10.
So what should I do for this so it will not happen again?

DavidTurner · December 18, 2018, 8:07am

As I said, my guess is that your storage hardware does not properly support the fsync() call. This is often due to a misconfiguration: write caching is sometimes enabled for performance reasons but this breaks fsync() unless all such caches are battery-backed. A simple way to check for this kind of problem is described in this article.

system · January 15, 2019, 8:07am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shard data is missing without any reason or log Elasticsearch	2	435	January 15, 2019
Shards failure - recovery possible? Elasticsearch	7	3460	June 6, 2020
Corrupted translog Elasticsearch	18	8163	June 27, 2017
Shard failing after a cluster restart Elasticsearch	1	963	July 5, 2017
Failed to retieve translog exception Elasticsearch	14	745	July 6, 2017

Failed shard recovery after hard shutdown

Related topics