Split brain problem with two master nodes

We have a requirement where we need to upgrade OS and for that Elasticsearch will be fresh installed on the nodes. I have a question related to master nodes removal.

We have 3 master nodes as part of the cluster. What approach can I take for removing master nodes from cluster for OS migration and then adding them back after OS migration is done.

Approach 1 - Remove one master node at a time for OS migration
Questions

  1. Can I remove one master node, following this blog and let the elasticsearch cluster run with 2 master nodes and add the removed master nodes once OS migration is done?
  2. With 2 master nodes, is there risk of split brain if they are not able to talk to each other?

Approach 2 - add a new master node at a time for OS migration, and then remove existing one after that , so that cluster has 3 master nodes at a time.

Both approaches are fine. No split-brain risk either way.

There is no risk of split brain, but if they are not able to talk to each other the cluster will be unavailable and you will need to wait for a new master election, which will require 3 nodes.

If your nodes are master nodes only I would simple add new masters and remove the others.

1 Like

@leandrojmp @DavidTurner
Thanks for details. If both nodes are not able to talk to each other, and cluster is not available. Then this approach is not fine, correct?
Or let me rephrase my question:
If there are two master nodes and they are not able to talk to each other, is my cluster at risk? will cluster be considered healthy and functional?

Thanks @DavidTurner !
For Option#1, Will cluster be functional and healthy if two remaining master nodes are not able to talk to each other?

Which node will be master in this case when two nodes are not able to talk to each other?

It is no at the risk of something like split brain, but it will not be healthy or functional, it will just stop answering to requests.

A two node cluster is not resilient to failures, if one of the nodes goes down or the nodes cannot communicate with each other, then the cluster will not work until a new master is elected, an to elect a new master you need at least 3 nodes.

So, in resume, if you remove one node to change the OS and this node is not the active master, your cluster will still be running and answering to requests, but it will not be able to afford losing another node.

From what you described I think that the safest solution is to add new master nodes in the new OS and cycle the older ones.

For example, add a new master, then cycle one old node, do this until all nodes are replaced.

But even with this approach you should use a maintenance window to afford some downtime.

Thanks @leandrojmp !
That answers my query as i was looking for which option to choose, and its Option#2 now as option#1 is risky for running cluster in Prod.