Network outage keeps split brain status (no recovery by ES) (was issue #5144)

Hi,

Recently we discovered that Elasticsearch is not able to solve a previous
split brain situation of an existing cluster. The problem (split brain and
further resolution) can be splitted into two main parts:

  1. Reorganization of the whole cluster and logging
  2. Resolution of data conflicts

The first thing should be fairly "easy" to solve. Discovery should take
place regularly and update the cluster organization if necessary.

The second thing would be more complex and dependent of what users are
doing. In our application it is not that important that conflicts caused by
split brain is solved by Elasticsearch - we can easily handle this
(re-import the data modified while the split brain situation).

IMHO it is much better to let ES solve the split brain than to let it run
"forever" in the split brain situation.

From the original issue
https://github.com/elasticsearch/elasticsearch/issues/5144 :


we have a 4 node ES cluster running ("plain" Zen discovery - no cloud
stuff). Two nodes are in one DC - two nodes in another DC.

When the network connection between both DCs fails, ES forms two two-node
ES clusters - a split brain. When the network is operative again, the split
brain situation is remains persistent.

I've setup a small local test with a 4 node ES cluster:

+--------+ +--------+
| Node A | ----\ /---- | Node C |
+--------+ .........../ +--------+
+--------+ / \ +--------+
| Node B | ----/ ---- | Node D |
+--------+ +--------+
Single ES cluster

When the network connection fails, two two node clusters exists (split
brain). I've simulated that with "iptables -A INPUT/OUTPUT -s/d -j DROP"
statements.

+--------+ +--------+
| Node A | ----\ /---- | Node C |
+--------+ \ / +--------+
+--------+ / \ +--------+
| Node B | ----/ ---- | Node D |
+--------+ +--------+
ES cluster ES cluster

When the network between nodes AB and CD is operative again, the single
cluster status is not restored (split brain is persistent).

It did not make a difference, whether unicast or multicast ZEN discovery is
used.

Another issue is that operating system keepalive settings affects the time
after which ES detects a node failure. Keepalive timeout settings (e.g.
net.ipv4.tcp_keepalive_time/probes/intvl) directly influence the node
failure detection.

There should be some task, that regularly polls the "alive" status of all
known other nodes.

Tested with ES 1.0.0 (and an older 0.90.3).


David Pilato: "Did you try to set minimum_master_node to 3? See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#master-election
"


Me: "Setting minimum_master_nodes to 3 is not an option. If I understand
correctly, it would force all 4 nodes to stop working at all - means: no
service at all. This wouldn't cover the case, that two nodes are taken down
for maintenance work. And what if there a three DCs (each with 2 nodes) - a
setting of minimum_master_nodes=5 would only allow one node to fail before
ES stops working. IMHO there should be a regular job inside ES, that checks
the existence of other nodes (either via unicast or via multicast) and
triggers (re-)discovery if necessary - the split brain situation must be
resolved."


David Pilato: "Exactly. Cluster will stop working until network connection
is up again.
What do you expect? Which part of the cluster should hold the master in
case of network outage?

Cross Data center replication is not supported yet and you should consider:

  • use the great snapshot and restore feature to snapshot from a DC and
    restore in the other one
  • index in both DC (so two distinct clusters) from a client level
  • use Tribe node feature to search or index on multiple clusters

I think we should move this conversation to the mailing list."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1cc24862-5a95-4e2e-9dc4-6d8d5445b016%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I agree with David.

With using the quorum of (n/2)+1 you are safe the cluster can always elect
a single leader.

For multiple DC, running a single cluster without such quorum is high risk
unless you have a reliable network. Why not running several clusters, one
per DC, and syncing them over a slow connection, depending on the
availability of the DCs?

ES is already running a regular job for (re-)discovering nodes.

If a split brain happened it is too late to resolve, without weird effects
after a rejoin. ES does not mark data operations with a distributed
timestamp protocol so conflict resolution must depend on voting. Such a
voting is not stable. With two halves of a cluster, you may have never a
winner, and data operations could be applied in wrong order.

Jörg

On Tue, Feb 18, 2014 at 10:59 AM, Robert Stupp
robert.stupp@googlemail.comwrote:

Hi,

Recently we discovered that Elasticsearch is not able to solve a previous
split brain situation of an existing cluster. The problem (split brain and
further resolution) can be splitted into two main parts:

  1. Reorganization of the whole cluster and logging
  2. Resolution of data conflicts

The first thing should be fairly "easy" to solve. Discovery should take
place regularly and update the cluster organization if necessary.

The second thing would be more complex and dependent of what users are
doing. In our application it is not that important that conflicts caused by
split brain is solved by Elasticsearch - we can easily handle this
(re-import the data modified while the split brain situation).

IMHO it is much better to let ES solve the split brain than to let it run
"forever" in the split brain situation.

From the original issue
Network outage keeps split brain status (no recovery by ES) · Issue #5144 · elastic/elasticsearch · GitHub :


we have a 4 node ES cluster running ("plain" Zen discovery - no cloud
stuff). Two nodes are in one DC - two nodes in another DC.

When the network connection between both DCs fails, ES forms two two-node
ES clusters - a split brain. When the network is operative again, the split
brain situation is remains persistent.

I've setup a small local test with a 4 node ES cluster:

+--------+ +--------+
| Node A | ----\ /---- | Node C |
+--------+ .........../ +--------+
+--------+ / \ +--------+
| Node B | ----/ ---- | Node D |
+--------+ +--------+
Single ES cluster

When the network connection fails, two two node clusters exists (split
brain). I've simulated that with "iptables -A INPUT/OUTPUT -s/d -j DROP"
statements.

+--------+ +--------+
| Node A | ----\ /---- | Node C |
+--------+ \ / +--------+
+--------+ / \ +--------+
| Node B | ----/ ---- | Node D |
+--------+ +--------+
ES cluster ES cluster

When the network between nodes AB and CD is operative again, the single
cluster status is not restored (split brain is persistent).

It did not make a difference, whether unicast or multicast ZEN discovery
is used.

Another issue is that operating system keepalive settings affects the time
after which ES detects a node failure. Keepalive timeout settings (e.g.
net.ipv4.tcp_keepalive_time/probes/intvl) directly influence the node
failure detection.

There should be some task, that regularly polls the "alive" status of all
known other nodes.

Tested with ES 1.0.0 (and an older 0.90.3).


David Pilato: "Did you try to set minimum_master_node to 3? See
Elasticsearch Platform — Find real-time answers at scale | Elastic
"


Me: "Setting minimum_master_nodes to 3 is not an option. If I understand
correctly, it would force all 4 nodes to stop working at all - means: no
service at all. This wouldn't cover the case, that two nodes are taken down
for maintenance work. And what if there a three DCs (each with 2 nodes) - a
setting of minimum_master_nodes=5 would only allow one node to fail before
ES stops working. IMHO there should be a regular job inside ES, that checks
the existence of other nodes (either via unicast or via multicast) and
triggers (re-)discovery if necessary - the split brain situation must be
resolved."


David Pilato: "Exactly. Cluster will stop working until network connection
is up again.
What do you expect? Which part of the cluster should hold the master in
case of network outage?

Cross Data center replication is not supported yet and you should consider:

  • use the great snapshot and restore feature to snapshot from a DC and
    restore in the other one
  • index in both DC (so two distinct clusters) from a client level
  • use Tribe node feature to search or index on multiple clusters

I think we should move this conversation to the mailing list."

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1cc24862-5a95-4e2e-9dc4-6d8d5445b016%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFt7iJ7NW6oLz7VGS2CSR%2BEuOt%2BpOM1Dz%3DK8CQnWf%3D-Kw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Am Dienstag, 18. Februar 2014 11:24:34 UTC+1 schrieb Jörg Prante:

With using the quorum of (n/2)+1 you are safe the cluster can always elect
a single leader.

Hm.. not really. 3*2 nodes --- quorum=6/2+1=4 --- a 4 + 2 nodes "split
brain" is still possible.
And ES stops working, if #nodes<minimum_master_nodes .
It would be better to switch ES to a "read only" mode - maybe by
introducing a new configuration option similar to minimum_master_nodes.

For multiple DC, running a single cluster without such quorum is high risk

unless you have a reliable network. Why not running several clusters, one
per DC, and syncing them over a slow connection, depending on the
availability of the DCs?

The connection between the DCs is fast and reliable (and expensive, I
guess). But pracice shows, that there is no 100% uptime guarantee.
Comparable to the reliability of network connections between two racks.

ES is already running a regular job for (re-)discovering nodes.

Again - it would be a benefit, if ES is able to detect a split brain
situation. Better than keeping the cluster(s) running normally.

If a split brain happened it is too late to resolve, without weird effects
after a rejoin. ES does not mark data operations with a distributed
timestamp protocol so conflict resolution must depend on voting. Such a
voting is not stable. With two halves of a cluster, you may have never a
winner, and data operations could be applied in wrong order.

Agreee.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/093086bf-af87-49da-9963-60504c5f176d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Robert,
I think the "(n/2)+1" formula should always work, and work reliably if
applied to
discovery.zen.minimum_master_nodes

It's absolutely impossible to establish a quorum and elect a new leader if

50% of total nodes. When and if quorum is established, the conditions to
elect a new leader in the remaining nodes can never exist (because would be
less than 50%)

Yes, the consequence is that the cluster becomes inoperable, but that is
the current result when a quorum is not achieved. Whether the cluster
should become read-only instead of inoperable is an interesting idea.

IMO,
Tony

On Tuesday, February 18, 2014 3:54:07 AM UTC-8, Robert Stupp wrote:

Am Dienstag, 18. Februar 2014 11:24:34 UTC+1 schrieb Jörg Prante:

With using the quorum of (n/2)+1 you are safe the cluster can always
elect a single leader.

Hm.. not really. 3*2 nodes --- quorum=6/2+1=4 --- a 4 + 2 nodes "split
brain" is still possible.
And ES stops working, if #nodes<minimum_master_nodes .
It would be better to switch ES to a "read only" mode - maybe by
introducing a new configuration option similar to minimum_master_nodes.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d1f63278-e7e0-4014-9c14-3d833a75f301%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You can have obscure network topology issues where each (master eligible)
node does not see all other (master eligible) nodes, in this case there may
be overlaps. So it is possible to have a split brain even if a (n/2)+1
quorum is active. The risk is very low but the probability of this insane
situation rises with the number of master eligible nodes.

Because of this insanity, there is good reason why it is preferable to keep
just 3 small data-less nodes as only master eligible with minimum master
node of 2. To keep an eye on these 3 nodes should be manageable. To these 3
nodes, you can add as many data nodes as you want, each of them should be
visible by at least one master node - that is enough to join.

The cluster I operate is in two racks, it hasn't split since mid of 2010.
Yes, I have tested network disruptions to cause split brains.

Jörg

On Tue, Feb 18, 2014 at 8:13 PM, Tony Su tonysu999@gmail.com wrote:

Robert,
I think the "(n/2)+1" formula should always work, and work reliably if
applied to
discovery.zen.minimum_master_nodes

It's absolutely impossible to establish a quorum and elect a new leader if

50% of total nodes. When and if quorum is established, the conditions to
elect a new leader in the remaining nodes can never exist (because would be
less than 50%)

Yes, the consequence is that the cluster becomes inoperable, but that is
the current result when a quorum is not achieved. Whether the cluster
should become read-only instead of inoperable is an interesting idea.

IMO,
Tony

On Tuesday, February 18, 2014 3:54:07 AM UTC-8, Robert Stupp wrote:

Am Dienstag, 18. Februar 2014 11:24:34 UTC+1 schrieb Jörg Prante:

With using the quorum of (n/2)+1 you are safe the cluster can always
elect a single leader.

Hm.. not really. 3*2 nodes --- quorum=6/2+1=4 --- a 4 + 2 nodes "split
brain" is still possible.
And ES stops working, if #nodes<minimum_master_nodes .
It would be better to switch ES to a "read only" mode - maybe by
introducing a new configuration option similar to minimum_master_nodes.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d1f63278-e7e0-4014-9c14-3d833a75f301%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFMSWRnsN3ng7kxrov4XQt3NkqHYjp%3DrH-_xhYfCC0_cg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

So,
It sounds like your very reasonable topology which configures only 3
Master-only nodes/some large number of Data nodes is a good way to manage
where the Master node(s) are. The only drawback I might see is that you're
relying on only 3 nodes although that number can be increased if desired
for fault tolerance.

Compared to making a far larger number (maybe even every node in the
cluster) master-eligible, I assume that approach should still also be a
reasonable configuration, and if fearing a split-brain is still a concern,
the requirement can be increased to any number more than (n.2)+1. but of
course the flip side will always be that higher quorum requirements
increase the possibility of downtime.

I'm just trying to imagine a scenario even if outright weird where some
minority could believe that they were a majority in the cluster, it almost
would seem that some kind of caching or other type of persistent storage of
node(s) might be involved, or maybe so many nodes in the cluster exist that
it takes a very long time to do the count and during that window a
master-eligible node could go down?

Just theorizing,
Tony

On Tuesday, February 18, 2014 12:02:08 PM UTC-8, Jörg Prante wrote:

You can have obscure network topology issues where each (master eligible)
node does not see all other (master eligible) nodes, in this case there may
be overlaps. So it is possible to have a split brain even if a (n/2)+1
quorum is active. The risk is very low but the probability of this insane
situation rises with the number of master eligible nodes.

Because of this insanity, there is good reason why it is preferable to
keep just 3 small data-less nodes as only master eligible with minimum
master node of 2. To keep an eye on these 3 nodes should be manageable. To
these 3 nodes, you can add as many data nodes as you want, each of them
should be visible by at least one master node - that is enough to join.

The cluster I operate is in two racks, it hasn't split since mid of 2010.
Yes, I have tested network disruptions to cause split brains.

Jörg

On Tue, Feb 18, 2014 at 8:13 PM, Tony Su <tony...@gmail.com <javascript:>>wrote:

Robert,
I think the "(n/2)+1" formula should always work, and work reliably if
applied to
discovery.zen.minimum_master_nodes

It's absolutely impossible to establish a quorum and elect a new
leader if >50% of total nodes. When and if quorum is established, the
conditions to elect a new leader in the remaining nodes can never exist
(because would be less than 50%)

Yes, the consequence is that the cluster becomes inoperable, but that is
the current result when a quorum is not achieved. Whether the cluster
should become read-only instead of inoperable is an interesting idea.

IMO,
Tony

On Tuesday, February 18, 2014 3:54:07 AM UTC-8, Robert Stupp wrote:

Am Dienstag, 18. Februar 2014 11:24:34 UTC+1 schrieb Jörg Prante:

With using the quorum of (n/2)+1 you are safe the cluster can always
elect a single leader.

Hm.. not really. 3*2 nodes --- quorum=6/2+1=4 --- a 4 + 2 nodes "split
brain" is still possible.
And ES stops working, if #nodes<minimum_master_nodes .
It would be better to switch ES to a "read only" mode - maybe by
introducing a new configuration option similar to minimum_master_nodes.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d1f63278-e7e0-4014-9c14-3d833a75f301%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5fe446dc-595f-459f-bd1a-2d41a826c15e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The idea to have three "node groups" (each with one master + N data nodes)
sounds good. In this situation quorum will work fine.
Combinded with a "read-only" instead of "inoperable" behaviour it would be
great.

It is not ideal but I think I can convince the ops team to add a third
group to the cluster.

The other solution would be to blow up ES with functionality that Cassandra
has (DC replication etc)... I think, that's off topic. :wink:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8d6b8c9-5776-46d8-b712-61f172dedc0b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.