How does cluster.auto_shrink_voting_configuration prevent split brain?

etki · September 6, 2023, 9:51am

We have some docs telling that it's not possible, but they don't explain much, just stating some things. How is the following situation avoided?

A cluster has voting configuration of 5 nodes.
A network partition occurs, cutting off 1 node.
After some time, auto-shrink kicks in and reduces the master eligible set to 4 nodes. The quorum is 2 nodes.
The network partition grows greedy and splits the cluster in half.
Now we have two partitions with quorums, each running on its own.

Christian_Dahlqvist · September 6, 2023, 9:52am

The quorum with 4 master eligible nodes is 3, not 2.

etki · September 6, 2023, 9:53am

My bad. Need more sleep.

DavidTurner · September 6, 2023, 10:00am

How does cluster.auto_shrink_voting_configuration prevent split brain?

TBC the cluster.auto_shrink_voting_configuration setting has nothing to do with preventing split-brain situations. See these docs:

No matter how it is configured, Elasticsearch will not suffer from a "split-brain" inconsistency. The cluster.auto_shrink_voting_configuration setting affects only its availability in the event of the failure of some of its nodes and the administrative tasks that must be performed as nodes join and leave the cluster.

etki · September 6, 2023, 10:10am

That's exactly the docs i'm referring to that provide little insight and state some claims. The reason i'm here is because cluster resizing is a very tricky thing, and one just can't declare such guarantees without an explanation. Split brain does have very much relation to dynamic cluster resizing, as it is a direct way to have a subset of nodes smaller than a half of the original one. The provided example is trivial, and the network partitions in general are not, absence of the example with a working scenario doesn't mean it doesn't exist, and as everyone knows there can be very fun things.

Christian_Dahlqvist · September 6, 2023, 10:16am

In the scenario you described you would not end up with a split brain. When you lose the first node you still require 3 master eligible nodes to elect a master, irrespective of whether the cluster thinks it has 4 or 5 master eligible nodes. If that cluster is split evenly in half (2+2 master eligible nodes) through a network partition you end up with 2 red clusters that can not elect a master. Only if the initial node comes back or the network partition is resolved will some part of the cluster again be able to elect a master.

DavidTurner · September 6, 2023, 11:04am

The rest of the page about voting configurations explains the mechanism that Elasticsearch uses to provide these guarantees. It is indeed tricky.

DavidTurner · September 6, 2023, 11:31am

I saw you opened an issue in Github but the info you are asking for is already covered in these docs:

etki · September 7, 2023, 11:19am

These are a bit unrelated things (except for the fact that they arose from my exploration of master management), it's about general cluster safety under normal operation rather than anything else. I'm afraid this link does exactly the same as the quoted in the issue

As long as there are at least three master-eligible nodes in the cluster, as a general rule it is best to remove nodes one-at-a-time, allowing enough time for the cluster to automatically adjust the voting configuration and adapt the fault tolerance level to the new set of nodes.

The information about how exactly the end user can see that changes are applied is missing, it's not clear what "enough time" is. I'm not worried here about my personal knowledge, it's the page that other people would refer to when they would be having an incident, and it would be crucial to have this information right there.

DavidTurner · September 7, 2023, 11:31am

It is there, but it is tricky too as you rightly pointed out. If you don't wait long enough then you're effectively removing multiple nodes at the same time, so you must use the voting config exclusions API:

Although the voting configuration exclusions API is most useful for down-scaling a two-node to a one-node cluster, it is also possible to use it to remove multiple master-eligible nodes all at the same time.

Edit to add: put differently, the voting config exclusions API is the correct and expected way to wait for the things you are asking about waiting for.

system · October 5, 2023, 11:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.