I was doing a huge indexing job(about 400 billion records), but suddenly one of the nodes went down because the power failed, after fixing the power, one shard is missing, and the error is:
- shard failure, reason [failed to recover from translog], failure EngineException, nested: EOFException[read past EOF. pos  length:  end: .
- cannot allocate because allocation is not permitted to any of nodes that hold an in-sync shard copy.
As it was a huge indexing, (and still is running very slow after the problem), there is no replica.
What should I do?
The translog was not properly written to disk before the power outage and is now corrupt. Did you set
index.translog.durability: async? If not, my guess is that your storage hardware does not properly support the
fsync() call, claiming to have persisted some writes before actually having done so.
The shard in question is broken, and the only truly reliable way forwards is to start again. You can wipe out the corrupt translog using the
elasticsearch-translog tool (or
elasticsearch-shard if in 6.5 or later) which will lose any writes that were not also written to Lucene. There's no way to tell which writes will be lost, unless you can somehow compare the data in Elasticsearch to your source data and fix it up.
I didn't set
index.translog.durability: async, and the filesystem is ext4 and disks are raid 10.
So what should I do for this so it will not happen again?
As I said, my guess is that your storage hardware does not properly support the
fsync() call. This is often due to a misconfiguration: write caching is sometimes enabled for performance reasons but this breaks
fsync() unless all such caches are battery-backed. A simple way to check for this kind of problem is described in this article.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.