Cannot get failed shard back online

agKaspar · June 4, 2021, 6:29pm

Hi, I'm trying to get failed shard in a single node cluster back online. The cluster is a single docker container running the docker.elastic.co/elasticsearch/elasticsearch:7.9.2 image. It has 10 indices and somewhere this afternoon one index went red with the following error:

nested: IOException[failed engine (reason: [refresh failed source[write indexing buffer]])];    
nested: CorruptIndexException[checksum status indeterminate: unexpected exception (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm")))];
nested: IOException[read past EOF: NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 6820: NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm")];
nested: EOFException[read past EOF: NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 6820];

Searching online I found I should give this command a try:

/usr/share/elasticsearch/jdk/bin/java -cp /usr/share/elasticsearch/lib/lucene-core-8.6.2.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/

This seems to check 25 segment files (?) which takes about ~15min but then reports everything is fine.
I tried calling this endpoint:

POST _cluster/reroute?master_timeout=5m
{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "dossiers-en",
        "shard": 0,
        "node": "elastic-search-7cb7cf9bf8-dhwmn",
        "accept_data_loss": true
      }
    }
  ]
}

But this gives me the same error as I started with again. Also restarting the whole container results in the same error.

Any tips on getting this index back up and running, preferably with no or minimal data loss?

warkolm · June 7, 2021, 4:20am

What's the output from _cat/allocation?v and an allocation explain for the shard?

Also I'd suggest upgrading 7.13 is the latest and you're a few versions behind

agKaspar · June 8, 2021, 6:33am

I was able to fix this by running elasticsearch-shard remove-corrupt-data --dir <index-location> on the faulty shard. The root cause most likely was poor kubernetes resource settings, causing the single elastic search pod to become evicted due to memory and/or CPU pressure. I've corrected this and the problem hasn't returned since.

We're in the process of replacing this single-node with a more advanced kubernetes deployment using the 'Elastic cloud on kubernetes' custom providers and guidelines (link), which should fix this problem for good.

system · July 6, 2021, 6:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shard failed at ElasticSearch / Docker / Windows Elasticsearch docker	5	1592	January 28, 2019
Single shard failing with snapshot Elasticsearch	4	311	November 7, 2023
Frequent shard failures Elasticsearch	7	690	July 20, 2023
Shard index gone bad, anyone know how to fix this: java.io.EOFException: read past EOF: NIOFSIndexInput Elasticsearch	3	2552	July 6, 2017
Nested: CorruptIndexException[failed engine (reason: [corrupt file (source: [index]) Elasticsearch	2	2930	April 27, 2018

Cannot get failed shard back online

Related topics