Hi, I'm trying to get failed shard in a single node cluster back online. The cluster is a single docker container running the docker.elastic.co/elasticsearch/elasticsearch:7.9.2
image. It has 10 indices and somewhere this afternoon one index went red with the following error:
nested: IOException[failed engine (reason: [refresh failed source[write indexing buffer]])];
nested: CorruptIndexException[checksum status indeterminate: unexpected exception (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm")))];
nested: IOException[read past EOF: NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 6820: NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm")];
nested: EOFException[read past EOF: NIOFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/_zu_1.fnm") buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 6820];
Searching online I found I should give this command a try:
/usr/share/elasticsearch/jdk/bin/java -cp /usr/share/elasticsearch/lib/lucene-core-8.6.2.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/xFpiL2YWSzOCBE7eNrGomQ/0/index/
This seems to check 25 segment files (?) which takes about ~15min but then reports everything is fine.
I tried calling this endpoint:
POST _cluster/reroute?master_timeout=5m
{
"commands": [
{
"allocate_empty_primary": {
"index": "dossiers-en",
"shard": 0,
"node": "elastic-search-7cb7cf9bf8-dhwmn",
"accept_data_loss": true
}
}
]
}
But this gives me the same error as I started with again. Also restarting the whole container results in the same error.
Any tips on getting this index back up and running, preferably with no or minimal data loss?