Different Cluster Behavior When Two Master-Eligible Nodes Are Stopped

I’m working with two different Elasticsearch clusters (version 8.17.5) and noticed different behavior when stopping master-eligible nodes. Here's the setup for both clusters:
Cluster A:
3 dedicated master nodes
3 dedicated data nodes
Cluster B:
3 dedicated master nodes
3 dedicated data nodes

In both clusters, I stop **2 out of 3 master-eligible nodes.

On Cluster A, the cluster remains functional — no issues are reported, and the remaining master-eligible node keeps the cluster running.
On Cluster B, the cluster becomes unavailable, with messages indicating the absence of a master.

This behavior confuses me because both clusters have the same architecture. According to quorum rules, I’d expect both clusters to require at least 2 master-eligible nodes to maintain quorum and elect a master.

My Questions:

  1. Why does Cluster A still function with only 1 master-eligible node running?

  2. Could this be due to differences in voting configuration,cluster state, or leftover voting exclusions?

  3. What’s the proper way to inspect and compare the master election configuration and coordination state between these clusters?

  4. Is there a recommended way to ensure consistent and predictable master election behavior across multiple clusters?

If that is the case I suspect your description of the cluster topology is incorrect. What is the output of the cat nodes API?

i check it and why this happened.

master 1 is master
master 2 changed node roles to cordinator and then restart it.
master 3 changed node roles to cordinator and then restart it.

cluster work correctly and when we restart master 1 nothing happen and work as well.

I am not sure I follow. Please provide the exact sequence of steps you performed and the output of the API I linked to, ideally at each stage if you can reproduce it.

10.58.5.3 16 89 0 0.08 0.08 0.09 - - as-se-stg-ees-master-3
10.58.5.2 14 89 0 0.06 0.06 0.08 - - as-se-stg-ees-master-2
10.58.5.1 60 98 3 0.66 0.21 0.12 m * as-se-stg-ees-master-1
10.58.5.6 63 98 6 0.32 0.31 0.36 d - as-se-stg-ees-data-3
10.58.5.4 56 95 17 0.48 0.42 0.37 d - as-se-stg-ees-data-1
10.58.5.5 17 94 8 0.18 0.31 0.34 d - as-se-stg-ees-data-2

When we have 3 master-eligible nodes and we stop two of them at once, the cluster becomes unavailable, which is expected due to lack of quorum.

However, if we instead gradually change the role of two master nodes one by one to coordinator-only nodes and restart them, the cluster continues to function with only one master node remaining.

So in this case, even though we end up with only one master-eligible node, the cluster still works — unlike the first scenario where stopping two masters caused the cluster to become unavailable.

Why do you think this is not correct?

why do you think this is incorrect?

A 6-node cluster, with 1-master-eligible node, is a valid topology. It might not be wise, but it is valid.

1 Like

That is all expected as you reconfigured the cluster before shutting down the nodes that used to be master eligible. I do not understand what the problem is.

I thought that if a cluster is initialized with 3 master-eligible nodes, it would always require at least 2 of them to be available in order to stay functional.

You changed the cluster to only have one master eligible node, which means it at that point would behave exactly as if it was initially configured that way.

1 Like

tnx for your reply

This is the key difference, you're bringing the nodes back into the cluster again so Elasticsearch can tell that they're no longer master-eligible, which means it's safe to reconfigure the cluster to ignore their votes. If you just shut them down and don't start them up again then it cannot do that.

2 Likes