Proper way to force a Master Election after Split Brain occurs


(Markus Gibson) #1

I would like to know the proper way to correctly handle a split brain situation, I am constantly seeing only "failed to send join request to master" messages in an attempt to connect to a node that is no longer there. The cluster is dead in the water, and does not respond to any requests.


(Mark Walkom) #2

Do you not have minimum masters set?
Have you restarted the existing/old master?
What does _cat/master show for each node in the cluster?


(Niraj Kumar) #3

How many nodes cluster you have and how do you determine it as split brain? Does all the nodes work independently and doesn't join cluster. What is the way of your node cluster join mechanism. (Zen or cloud plugins like cloud-aws)

--
Niraj


(Markus Gibson) #4

Hi, i have the following setting:

discovery.zen.minimum_master_nodes: 1

The old master in this case is completely gone.


(Markus Gibson) #5
  1. I'm thinking that this is split brain, since none of the masters will actually hold an election as they are still looking for the old master.

  2. We have 30 nodes total 3 - master, 2 - client, 25 - data nodes.

  3. No nodes work independently

    discovery.zen.ping.multicast.enabled: false
    discovery.zen.ping.unicast.hosts: [ List of hosts here ]

We also have the following set

cloud.node.auto_attributes: true

However we are not in AWS, but an internal cloud solution similar to VMware.


(Christian Dahlqvist) #6

If you have 3 master eligible nodes, you should have discovery.zen.minimum_master_nodes set to 2, not 1, as described in the Definitive Guide.


(Markus Gibson) #7

Yes, I just found that thank you. We use Chef to write out the config, now to figure out why chef suddenly started writing a 1 rather than a 2


(Markus Gibson) #8

I've changed it to 2 however, it's still looking for an old master node. Is there a cache somewhere I need to clear?


(Niraj Kumar) #9

I believe there is no cache mechanism. You can do a quick reboot i believe taking a good amount of downtime.


(system) #10