Elasticsearch Node went down abruptly and lost data

I am running Elasticsearch cluster version 7.10.0
It's running for a year now, never faced any issues.
Our setup comprises 1 primary and 1 replica in a different availability zone. Distribution is through the rack_id attribute.
Recently one of the days I faced an issue the cluster went to Red state and we lost data for 2 indexes out of 500+ indexes.

When tried to debug the reason, I checked masters logs and found that one of FailedNodeException log was there which is shared in the below gist:

On further investigating logs into this node, found the following exception logs

The surprising part of if one of my node even failed, replica should have been there, then why index went to red state and it wasn't able to recover after that.

Also, after closely seeing each and every logs in the machine found that there more than just 2 indexes shards corrupted but other one got recovered on it's own.

Also, the FailedNode was auto recovered and joined the cluster back in few minutes itself.

Now, I am not able to understand what could be checked for this scenario, even the corrupted file location in the logs we see weren't there by the time node joined back on it's own.

Need help in figuring out what could have happened so can see how we can avoid in future.

See these docs for more information on troubleshooting a CorruptIndexException.

Thanks for sharing this doc, What I am not able to get here is I had one replica as well and let's say the shard corrupted on one of the nodes because of some arbitrary issue even then it should be able to promote the replica to primary and recover, is that correct?

In this case, I can see only one node went down(from the master's logs), why it didn't promote the replica to primary and recover those shards is something I couldn't get it.

That can happen if the replica is itself unhealthy when the primary fails.

@DavidTurner Yes If replica is unhealthy it can happen, but before and during the time the node went down, there were no logs for any other node down on master.
How do we confirm this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.