I am running Elasticsearch cluster version 7.10.0
It's running for a year now, never faced any issues.
Our setup comprises 1 primary and 1 replica in a different availability zone. Distribution is through the rack_id attribute.
Recently one of the days I faced an issue the cluster went to Red state and we lost data for 2 indexes out of 500+ indexes.
When tried to debug the reason, I checked masters logs and found that one of FailedNodeException log was there which is shared in the below gist:
On further investigating logs into this node, found the following exception logs
The surprising part of if one of my node even failed, replica should have been there, then why index went to red state and it wasn't able to recover after that.
Also, after closely seeing each and every logs in the machine found that there more than just 2 indexes shards corrupted but other one got recovered on it's own.
Also, the FailedNode was auto recovered and joined the cluster back in few minutes itself.
Now, I am not able to understand what could be checked for this scenario, even the corrupted file location in the logs we see weren't there by the time node joined back on it's own.
Need help in figuring out what could have happened so can see how we can avoid in future.
Thanks for sharing this doc, What I am not able to get here is I had one replica as well and let's say the shard corrupted on one of the nodes because of some arbitrary issue even then it should be able to promote the replica to primary and recover, is that correct?
In this case, I can see only one node went down(from the master's logs), why it didn't promote the replica to primary and recover those shards is something I couldn't get it.