Cannot get failed shard back online

Hi, I'm trying to get failed shard in a single node cluster back online. The cluster is a single docker container running the docker.elastic.co/elasticsearch/elasticsearch:7.9.2 image. It has 10 indices and somewhere this afternoon one index went red with the following error:

nested: IOException[failed engine (reason: [refresh failed source[write indexing buffer]])];    
nested: CorruptIndexException[checksum status indeterminate: unexpected exception (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm")))];
nested: IOException[read past EOF: NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 6820: NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm")];
nested: EOFException[read past EOF: NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 6820]; 

Searching online I found I should give this command a try:

/usr/share/elasticsearch/jdk/bin/java -cp /usr/share/elasticsearch/lib/lucene-core-8.6.2.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/

This seems to check 25 segment files (?) which takes about ~15min but then reports everything is fine.
I tried calling this endpoint:

POST _cluster/reroute?master_timeout=5m
{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "dossiers-en",
        "shard": 0,
        "node": "elastic-search-7cb7cf9bf8-dhwmn",
        "accept_data_loss": true
      }
    }
  ]
}

But this gives me the same error as I started with again. Also restarting the whole container results in the same error.

Any tips on getting this index back up and running, preferably with no or minimal data loss?

What's the output from _cat/allocation?v and an allocation explain for the shard?

Also I'd suggest upgrading 7.13 is the latest and you're a few versions behind :slight_smile:

I was able to fix this by running elasticsearch-shard remove-corrupt-data --dir <index-location> on the faulty shard. The root cause most likely was poor kubernetes resource settings, causing the single elastic search pod to become evicted due to memory and/or CPU pressure. I've corrected this and the problem hasn't returned since.

We're in the process of replacing this single-node with a more advanced kubernetes deployment using the 'Elastic cloud on kubernetes' custom providers and guidelines (link), which should fix this problem for good.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.