I have the following situation. I'm using 6 ES nodes with unicast
(unfortunately I cannot use multicast), I have discovery pings and timeouts
set up (as defaults for now, 3 pings, 30 sec timeouts).
Yesterday we had a network issue for more than 3 minutes and that caused
that the ES cluster fell into 4 pieces (my config required only one
master-able node which is also unfortunate, whatever...).
In the logs I've seen that ES nodes decided to split the cluster after the
3rd ping.
The problem is that once the network came back up, the ES nodes did not try
to rejoin.
How should this work? Is there any option that I could set so if the
cluster is not complete (not each node is connected) retry in every X
seconds to rejoin all whole cluster (use case: unicast, each node knows
about all of the cluster nodes)?
If there isn't any option like that, is there any way (for example calling
an admin REST url) to force the rejoin of the cluster?
I know that this could be done with restarting each nodes (running the ES
with a through a process manager and calling the _shutdown REST url for
each node), but I don't really want to restart the nodes if it's not a must.
"The discovery.zen.minimum_master_nodes allows to control the minimum
number of master eligible nodes a node should “see” in order to operate
within the cluster. Its recommended to set it to a higher value than 1 when
running more than 2 nodes in the cluster."
the best practice is to set it to at least N/2 + 1, but keep in mind:
On Tuesday, 27 August 2013 20:00:10 UTC-4, András Pálinkás wrote:
Hello ES-Team,
I have the following situation. I'm using 6 ES nodes with unicast
(unfortunately I cannot use multicast), I have discovery pings and timeouts
set up (as defaults for now, 3 pings, 30 sec timeouts).
Yesterday we had a network issue for more than 3 minutes and that caused
that the ES cluster fell into 4 pieces (my config required only one
master-able node which is also unfortunate, whatever...).
In the logs I've seen that ES nodes decided to split the cluster after the
3rd ping.
The problem is that once the network came back up, the ES nodes did not
try to rejoin.
How should this work? Is there any option that I could set so if the
cluster is not complete (not each node is connected) retry in every X
seconds to rejoin all whole cluster (use case: unicast, each node knows
about all of the cluster nodes)?
If there isn't any option like that, is there any way (for example calling
an admin REST url) to force the rejoin of the cluster?
I know that this could be done with restarting each nodes (running the ES
with a through a process manager and calling the _shutdown REST url for
each node), but I don't really want to restart the nodes if it's not a must.
This behavior is really a problem in certain scenarios. For a lot of our
customers, its not an option to need to have N/2+1 nodes running for the
system to be available, they want the system to be up running with e.g 1 of
14 nodes. I would really like to see a future option to force nodes to try
rejoining cluster if any other nodes found in e.g unicast list.
"The discovery.zen.minimum_master_nodes allows to control the minimum
number of master eligible nodes a node should “see” in order to operate
within the cluster. Its recommended to set it to a higher value than 1 when
running more than 2 nodes in the cluster."
the best practice is to set it to at least N/2 + 1, but keep in mind:
On Tuesday, 27 August 2013 20:00:10 UTC-4, András Pálinkás wrote:
Hello ES-Team,
I have the following situation. I'm using 6 ES nodes with unicast
(unfortunately I cannot use multicast), I have discovery pings and timeouts
set up (as defaults for now, 3 pings, 30 sec timeouts).
Yesterday we had a network issue for more than 3 minutes and that caused
that the ES cluster fell into 4 pieces (my config required only one
master-able node which is also unfortunate, whatever...).
In the logs I've seen that ES nodes decided to split the cluster after
the 3rd ping.
The problem is that once the network came back up, the ES nodes did not
try to rejoin.
How should this work? Is there any option that I could set so if the
cluster is not complete (not each node is connected) retry in every X
seconds to rejoin all whole cluster (use case: unicast, each node knows
about all of the cluster nodes)?
If there isn't any option like that, is there any way (for example
calling an admin REST url) to force the rejoin of the cluster?
I know that this could be done with restarting each nodes (running the ES
with a through a process manager and calling the _shutdown REST url for
each node), but I don't really want to restart the nodes if it's not a must.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.