Failed shard recovery after hard shutdown

I was doing a huge indexing job(about 400 billion records), but suddenly one of the nodes went down because the power failed, after fixing the power, one shard is missing, and the error is:

  • shard failure, reason [failed to recover from translog], failure EngineException, nested: EOFException[read past EOF. pos [4590678] length: [4] end: [4590678].
  • cannot allocate because allocation is not permitted to any of nodes that hold an in-sync shard copy.

As it was a huge indexing, (and still is running very slow after the problem), there is no replica.
What should I do?

The translog was not properly written to disk before the power outage and is now corrupt. Did you set index.translog.durability: async? If not, my guess is that your storage hardware does not properly support the fsync() call, claiming to have persisted some writes before actually having done so.

The shard in question is broken, and the only truly reliable way forwards is to start again. You can wipe out the corrupt translog using the elasticsearch-translog tool (or elasticsearch-shard if in 6.5 or later) which will lose any writes that were not also written to Lucene. There's no way to tell which writes will be lost, unless you can somehow compare the data in Elasticsearch to your source data and fix it up.

I didn't set index.translog.durability: async, and the filesystem is ext4 and disks are raid 10.
So what should I do for this so it will not happen again?

As I said, my guess is that your storage hardware does not properly support the fsync() call. This is often due to a misconfiguration: write caching is sometimes enabled for performance reasons but this breaks fsync() unless all such caches are battery-backed. A simple way to check for this kind of problem is described in this article.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.