Elasticsearch Node went down abruptly and lost data

nsoni · September 21, 2023, 5:07pm

I am running Elasticsearch cluster version 7.10.0
It's running for a year now, never faced any issues.
Our setup comprises 1 primary and 1 replica in a different availability zone. Distribution is through the rack_id attribute.
Recently one of the days I faced an issue the cluster went to Red state and we lost data for 2 indexes out of 500+ indexes.

When tried to debug the reason, I checked masters logs and found that one of FailedNodeException log was there which is shared in the below gist:

gist.github.com

https://gist.github.com/nsonic001/96748e1061de003c031812ddde1049c0

gistfile1.txt

[2023-09-11T14:17:08,803][WARN ][o.e.g.G.InternalReplicaShardAllocator] [datapoints_default_cs2-m5xl-master4] [pleng_02m_202337][1]: failed to list shard for shard_store on node [0-g6CNnBRWW_r0t_1P9RNQ]
org.elasticsearch.action.FailedNodeException: Failed node [0-g6CNnBRWW_r0t_1P9RNQ]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:256) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:177) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:231) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:640) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:277) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:224) [elasticsearch-7.10.0.jar:7.10.0]

This file has been truncated. show original

On further investigating logs into this node, found the following exception logs

gist.github.com

https://gist.github.com/nsonic001/d8370ba6c9cc55576356991287337536

gistfile1.txt

[2023-09-11T12:42:37,095][WARN ][o.e.i.e.Engine           ] [datapoints_default_cs2-i3en2xl_1a-data_hot2] [customers_02m_202337][0] failed engine [search execution corruption failure]
org.elasticsearch.search.fetch.FetchPhaseExecutionException: Fetch Failed [Error running fetch phase for doc [1473264]]
        at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:171) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.SearchService.lambda$executeFetchPhase$3(SearchService.java:589) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0.jar:7.10.0]

This file has been truncated. show original

The surprising part of if one of my node even failed, replica should have been there, then why index went to red state and it wasn't able to recover after that.

Also, after closely seeing each and every logs in the machine found that there more than just 2 indexes shards corrupted but other one got recovered on it's own.

Also, the FailedNode was auto recovered and joined the cluster back in few minutes itself.

Now, I am not able to understand what could be checked for this scenario, even the corrupted file location in the logs we see weren't there by the time node joined back on it's own.

Need help in figuring out what could have happened so can see how we can avoid in future.

DavidTurner · September 21, 2023, 6:16pm

See these docs for more information on troubleshooting a CorruptIndexException.

nsoni · September 22, 2023, 8:51am

Thanks for sharing this doc, What I am not able to get here is I had one replica as well and let's say the shard corrupted on one of the nodes because of some arbitrary issue even then it should be able to promote the replica to primary and recover, is that correct?

In this case, I can see only one node went down(from the master's logs), why it didn't promote the replica to primary and recover those shards is something I couldn't get it.

DavidTurner · September 22, 2023, 12:25pm

That can happen if the replica is itself unhealthy when the primary fails.

nsoni · September 25, 2023, 5:16am

@DavidTurner Yes If replica is unhealthy it can happen, but before and during the time the node went down, there were no logs for any other node down on master.
How do we confirm this?

system · October 23, 2023, 5:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Recover shard failed Elasticsearch	1	1563	November 16, 2017
Elasticsearch 1.4 node loss discovery Elasticsearch	2	830	July 5, 2017
Not able to index data if one ES goes down in cluster Elasticsearch	5	405	July 6, 2017
Data node constantly dropping out of the cluster Elasticsearch	19	7182	October 2, 2019
Getting multiple exception on restart of node - org.elasticsearch.action.FailedNodeException Elasticsearch	2	2129	March 29, 2018

Elasticsearch Node went down abruptly and lost data

Related topics