running version 6.x which relies on minimum master node config for quorum.
the cluster has three master eligible nodes (1,2,3) and active master was (1). One master node (3) left the cluster due to an unknown error and rejoined when it was restarted manually after a few days. At this point of time, this master (3) had stale data and I presume it will gradually sync its internal state from active master (1). However, at this moment, the active master (1) left the cluster and (3) became the active master. This caused dangling handlers to shards created in past few days (while 3 was down) and lead to loss of data. My question is why did Elasticsearch elect (3) as active master while it had stale data and was in process of syncing from (1)? Does it provide any protection in this scenario and prefer (2) over (3) for master election?
Welcome to our community!
First things, 6.X is EOL and no longer supported, you should be looking to upgrade as a matter of urgency.
Can you post the output from the _cluster/stats?pretty&human
API for us to look at? It'll help provide more context on your cluster.
As its an internal cluster, I cannot post the details or output of the call. Here are some stats that might be of help. please let me know if you need specific details:
60 data nodes
3 master eligible nodes
minimum master nodes = 2
400 indexes
2600 shards replication 1:1
16 billion documents
70,000 segments
In general, what criteria Elasticsearch uses to elect an active master from 2 or 3 eligible nodes. Does staleness of a master state factor into this?
It does, but in V6 and earlier it's not watertight. You should upgrade to a version that isn't EOL as a matter of urgency.
we are planning actively to upgrade soon to latest version. Do you have a reference on this topic? It will help us in improving our capacity planning against failure modes.
What do you mean by a reference? The EOL docs already linked in this thread show that you're using an unacceptably old version. You're missing out on literally years of bugfixes and performance improvements.
My interest is in reference to latest docs documenting/explaining this behavior (around master election) in more detail that will help also explain why this scenario would have been avoided in current versions.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.