Split-brain situation - forcing discovery and rejoin

Andras_Palinkas · August 28, 2013, 12:00am

Hello ES-Team,

I have the following situation. I'm using 6 ES nodes with unicast
(unfortunately I cannot use multicast), I have discovery pings and timeouts
set up (as defaults for now, 3 pings, 30 sec timeouts).
Yesterday we had a network issue for more than 3 minutes and that caused
that the ES cluster fell into 4 pieces (my config required only one
master-able node which is also unfortunate, whatever...).
In the logs I've seen that ES nodes decided to split the cluster after the
3rd ping.

The problem is that once the network came back up, the ES nodes did not try
to rejoin.
How should this work? Is there any option that I could set so if the
cluster is not complete (not each node is connected) retry in every X
seconds to rejoin all whole cluster (use case: unicast, each node knows
about all of the cluster nodes)?

If there isn't any option like that, is there any way (for example calling
an admin REST url) to force the rejoin of the cluster?
I know that this could be done with restarting each nodes (running the ES
with a through a process manager and calling the _shutdown REST url for
each node), but I don't really want to restart the nodes if it's not a must.

Thanks in advance,
Andras

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

btiernay · August 28, 2013, 1:26am

See

"The discovery.zen.minimum_master_nodes allows to control the minimum
number of master eligible nodes a node should “see” in order to operate
within the cluster. Its recommended to set it to a higher value than 1 when
running more than 2 nodes in the cluster."

the best practice is to set it to at least N/2 + 1, but keep in mind:

http://elasticsearch-users.115913.n3.nabble.com/minimum-master-node-not-working-as-expected-td4037027.html

On Tuesday, 27 August 2013 20:00:10 UTC-4, András Pálinkás wrote:

Hello ES-Team,

I have the following situation. I'm using 6 ES nodes with unicast
(unfortunately I cannot use multicast), I have discovery pings and timeouts
set up (as defaults for now, 3 pings, 30 sec timeouts).
Yesterday we had a network issue for more than 3 minutes and that caused
that the ES cluster fell into 4 pieces (my config required only one
master-able node which is also unfortunate, whatever...).
In the logs I've seen that ES nodes decided to split the cluster after the
3rd ping.

The problem is that once the network came back up, the ES nodes did not
try to rejoin.
How should this work? Is there any option that I could set so if the
cluster is not complete (not each node is connected) retry in every X
seconds to rejoin all whole cluster (use case: unicast, each node knows
about all of the cluster nodes)?

If there isn't any option like that, is there any way (for example calling
an admin REST url) to force the rejoin of the cluster?
I know that this could be done with restarting each nodes (running the ES
with a through a process manager and calling the _shutdown REST url for
each node), but I don't really want to restart the nodes if it's not a must.

Thanks in advance,
Andras

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Runar_Myklebust · August 28, 2013, 6:54am

This behavior is really a problem in certain scenarios. For a lot of our
customers, its not an option to need to have N/2+1 nodes running for the
system to be available, they want the system to be up running with e.g 1 of
14 nodes. I would really like to see a future option to force nodes to try
rejoining cluster if any other nodes found in e.g unicast list.

On Wed, Aug 28, 2013 at 3:26 AM, btiernay rtiernay@gmail.com wrote:

See

Elasticsearch Platform — Find real-time answers at scale | Elastic

"The discovery.zen.minimum_master_nodes allows to control the minimum
number of master eligible nodes a node should “see” in order to operate
within the cluster. Its recommended to set it to a higher value than 1 when
running more than 2 nodes in the cluster."

the best practice is to set it to at least N/2 + 1, but keep in mind:

http://elasticsearch-users.115913.n3.nabble.com/minimum-master-node-not-working-as-expected-td4037027.html

On Tuesday, 27 August 2013 20:00:10 UTC-4, András Pálinkás wrote:

Hello ES-Team,

I have the following situation. I'm using 6 ES nodes with unicast
(unfortunately I cannot use multicast), I have discovery pings and timeouts
set up (as defaults for now, 3 pings, 30 sec timeouts).
Yesterday we had a network issue for more than 3 minutes and that caused
that the ES cluster fell into 4 pieces (my config required only one
master-able node which is also unfortunate, whatever...).
In the logs I've seen that ES nodes decided to split the cluster after
the 3rd ping.

The problem is that once the network came back up, the ES nodes did not
try to rejoin.
How should this work? Is there any option that I could set so if the
cluster is not complete (not each node is connected) retry in every X
seconds to rejoin all whole cluster (use case: unicast, each node knows
about all of the cluster nodes)?

If there isn't any option like that, is there any way (for example
calling an admin REST url) to force the rejoin of the cluster?
I know that this could be done with restarting each nodes (running the ES
with a through a process manager and calling the _shutdown REST url for
each node), but I don't really want to restart the nodes if it's not a must.

Thanks in advance,
Andras

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
mvh

Runar Myklebust

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Shutdown node cannot re-join the cluster Elasticsearch	2	297	July 6, 2017
0.90.7 - split brain right out of the box Elasticsearch	1	322	July 6, 2017
Rejoin master-data node back to cluster Elasticsearch	2	1151	February 6, 2019
Frequent disconnects between nodes Elasticsearch	13	2293	July 6, 2017
Split brains after long GCs Elasticsearch	3	394	July 6, 2017

Split-brain situation - forcing discovery and rejoin

-- mvh

Related topics

--
mvh