Split brain due to 'on the fence' network partition

Mark_Tinsley · November 20, 2013, 9:52am

Hi all,

I have been having some strange occurrences using elasticsearch on aws.

The setup is three nodes each with the setting of:
cluster.name: <clustername>
bootstrap.mlockall: true
discovery.zen.ping.multicast.enabled : false
discovery.type : ec2
discovery.ec2.ping_timeout : 30s
discovery.ec2.groups: <group>
cloud.aws.region : <region>
action.disable_delete_all_indices : true
discovery.zen.minimum_master_nodes : 2

I have witnessed two occurrences of the following:
Given 3 nodes A, B, C. Which are all in the same availability zone.

To start with all nodes are connected in the cluster. A is the master.
For some reason, node A and B cannot talk to each other. but both
can still talk to C and C can talk to A and B i.e. a 'on the
fence' network partition as C can still see all:
A:[2013-11-17 20:23:28,257][INFO ][cluster.service ] [A]
removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/<ipaddress>:9300]],},
reason: zen-disco-node_failed([B][sUv4amcFSdmaDAVDa7bUVg][inet[/<
ipaddress>:9300]]), reason failed to ping, tried [3] times, each with
maximum [30s] timeout

B:*[2013-11-17 20:25:27,543][INFO ][discovery.ec2 ] [B]
master_left [[A][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2013-11-17 20:25:27,547][INFO ][cluster.service ] [B] master
{new [B][sUv4amcFSdmaDAVDa7bUVg][inet[/<ipaddress>:9300]], previous [
A][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]]}, removed {[A
][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]],}, reason:
zen-disco-master_failed ([A][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress

:9300]])
C: [2013-11-17 20:23:28,256][INFO ][cluster.service ] [C]
removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/<ipaddress>:9300]],},
reason: zen-disco-receive(from master [[A
][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]]])

As you can see B is now a new master but A has not been removed as a
master, because A can still see C so has the minimum master node
criteria satisfied.

When I ask B for it's state it responds stating that it is a master with
C.

When I ask A for it's state it responds stating that it is a master with
C.

When I ask C for it's state it responds with the same cluster state as A
.

This can be replicated by setting up three nodes (settings above), then
once a master has been established drop the connection between it and what
you assume will be the next master (usually the next node in the list after
the master). I used the following commands:

On the master node (A): iptables -A INPUT -s <node B ip address> -j DROP

On the next node (B): iptables -A INPUT -s <node A ip address> -j DROP

This should get you in the same state that I have witnessed in aws, once
two masters are established remove the iptables entries (running iptables
-F on A and B). From what I understand node discovery only happens when
a node is starting up or does not belong to a cluster, so as these nodes do
belong to a cluster they never discover each other.

I have tried this against versions 0.90.0, 0.90.4, 0.90.7 and
1.0.0.Beta1.zip of elasticsearch with no luck. I was using the
elasticsearch-cloud-aws plugin version 1.11.0 for elasticsearch version
0.90.0 and version 1.15.0 for elasticsearch versions 0.90.4, 0.90.7 and
1.0.0.Beta1.

I do not want to have to set minimum master nodes to 3 as for this use case
I value availability.

Any help would be greatly appreciated.

Kind Regards,

Mark Tinsley

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · November 20, 2013, 3:36pm

I think you should open an issue in elasticsearch project with that excellent description you wrote.
Don't know how it could be fixed though.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 20 novembre 2013 at 10:52:11, Mark Tinsley (markctinsley@gmail.com) a écrit:

Hi all,

I have been having some strange occurrences using elasticsearch on aws.

The setup is three nodes each with the setting of:
cluster.name:
bootstrap.mlockall: true
discovery.zen.ping.multicast.enabled : false
discovery.type : ec2
discovery.ec2.ping_timeout : 30s
discovery.ec2.groups:
cloud.aws.region :
action.disable_delete_all_indices : true
discovery.zen.minimum_master_nodes : 2

I have witnessed two occurrences of the following:
Given 3 nodes A, B, C. Which are all in the same availability zone.
To start with all nodes are connected in the cluster. A is the master.
For some reason, node A and B cannot talk to each other. but both can still talk to C and C can talk to A and B i.e. a 'on the fence' network partition as C can still see all:
A:[2013-11-17 20:23:28,257][INFO ][cluster.service ] [A] removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/:9300]],}, reason: zen-disco-node_failed([B][sUv4amcFSdmaDAVDa7bUVg][inet[/:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
B:[2013-11-17 20:25:27,543][INFO ][discovery.ec2 ] [B] master_left [[A][O25rauSQR7utohD0jg4RQw][inet[/:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2013-11-17 20:25:27,547][INFO ][cluster.service ] [B] master {new [B][sUv4amcFSdmaDAVDa7bUVg][inet[/:9300]], previous [A][O25rauSQR7utohD0jg4RQw][inet[/:9300]]}, removed {[A][O25rauSQR7utohD0jg4RQw][inet[/:9300]],}, reason: zen-disco-master_failed ([A][O25rauSQR7utohD0jg4RQw][inet[/:9300]])
C: [2013-11-17 20:23:28,256][INFO ][cluster.service ] [C] removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/:9300]],}, reason: zen-disco-receive(from master [[A][O25rauSQR7utohD0jg4RQw][inet[/:9300]]])
As you can see B is now a new master but A has not been removed as a master, because A can still see C so has the minimum master node criteria satisfied.

When I ask B for it's state it responds stating that it is a master with C.

When I ask A for it's state it responds stating that it is a master with C.

When I ask C for it's state it responds with the same cluster state as A.

This can be replicated by setting up three nodes (settings above), then once a master has been established drop the connection between it and what you assume will be the next master (usually the next node in the list after the master). I used the following commands:

On the master node (A): iptables -A INPUT -s -j DROP

On the next node (B): iptables -A INPUT -s -j DROP

This should get you in the same state that I have witnessed in aws, once two masters are established remove the iptables entries (running iptables -F on A and B). From what I understand node discovery only happens when a node is starting up or does not belong to a cluster, so as these nodes do belong to a cluster they never discover each other.

I have tried this against versions 0.90.0, 0.90.4, 0.90.7 and 1.0.0.Beta1.zip of elasticsearch with no luck. I was using the elasticsearch-cloud-aws plugin version 1.11.0 for elasticsearch version 0.90.0 and version 1.15.0 for elasticsearch versions 0.90.4, 0.90.7 and 1.0.0.Beta1.

I do not want to have to set minimum master nodes to 3 as for this use case I value availability.

Any help would be greatly appreciated.

Kind Regards,

Mark Tinsley

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Leonardo_Menezes · November 20, 2013, 3:50pm

this issue is already reported here:

github.com/elastic/elasticsearch

minimum_master_nodes does not prevent split-brain if splits are intersecting

opened 08:15AM - 17 Dec 12 UTC

closed 02:58PM - 01 Sep 14 UTC

saj

>bug v2.0.0-beta1 v1.4.0.Beta1

G'day, I'm using ElasticSearch 0.19.11 with the unicast Zen discovery protocol.… With this setup, I can easily split a 3-node cluster into two 'hemispheres' (continuing with the brain metaphor) with one node acting as a participant in both hemispheres. I believe this to be a significant problem, because now `minimum_master_nodes` is incapable of preventing certain split-brain scenarios. Here's what my 3-node test cluster looked like before I broke it: ![](https://saj.beta.anchortrove.com/es-splitbrain-1.png) Here's what the cluster looked like after simulating a communications failure between nodes (2) and (3): ![](https://saj.beta.anchortrove.com/es-splitbrain-2.png) Here's what seems to have happened immediately after the split: 1. Node (2) and (3) lose contact with one another. (`zen-disco-node_failed` ... `reason failed to ping`) 2. Node (2), still master of the left hemisphere, notes the disappearance of node (3) and broadcasts an advisory message to all of its followers. Node (1) takes note of the advisory. 3. Node (3) has now lost contact with its old master and decides to hold an election. It declares itself winner of the election. On declaring itself, it assumes master role of the right hemisphere, then broadcasts an advisory message to all of its followers. Node (1) takes note of this advisory, too. At this point, I can't say I know what to expect to find on node (1). If I query both masters for a list of nodes, I see node (1) in both clusters. Let's look at `minimum_master_nodes` as it applies to this test cluster. Assume I had set `minimum_master_nodes` to 2. Had node (3) been completely isolated from nodes (1) and (2), I would not have run into this problem. The left hemisphere would have enough nodes to satisfy the constraint; the right hemisphere would not. This would continue to work for larger clusters (with an appropriately larger value for `minimum_master_nodes`). The problem with `minimum_master_nodes` is that it does not work when the split brains are intersecting, as in my example above. Even on a larger cluster of, say, 7 nodes with `minimum_master_nodes` set to 4, all that needs to happen is for the 'right' two nodes to lose contact with one another (a master election has to take place) for the cluster to split. Is there anything that can be done to detect the intersecting split on node (1)? Would #1057 help? Am I missing something obvious? :)

no solution though.

On Wed, Nov 20, 2013 at 4:36 PM, David Pilato david@pilato.fr wrote:

I think you should open an issue in elasticsearch project with that
excellent description you wrote.
Don't know how it could be fixed though.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 20 novembre 2013 at 10:52:11, Mark Tinsley (markctinsley@gmail.com //markctinsley@gmail.com)
a écrit:

Hi all,

I have been having some strange occurrences using elasticsearch on aws.

The setup is three nodes each with the setting of:
cluster.name: <clustername>
bootstrap.mlockall: true
discovery.zen.ping.multicast.enabled : false
discovery.type : ec2
discovery.ec2.ping_timeout : 30s
discovery.ec2.groups: <group>
cloud.aws.region : <region>
action.disable_delete_all_indices : true
discovery.zen.minimum_master_nodes : 2

I have witnessed two occurrences of the following:
Given 3 nodes A, B, C. Which are all in the same availability zone.

To start with all nodes are connected in the cluster. A is the
master.

For some reason, node A and B cannot talk to each other. but
both can still talk to C and C can talk to A and B i.e. a 'on
the fence' network partition as C can still see all:
A:[2013-11-17 20:23:28,257][INFO ][cluster.service ] [A]
removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/<ipaddress>:9300]],},
reason: zen-disco-node_failed([B][sUv4amcFSdmaDAVDa7bUVg][inet[/<
ipaddress>:9300]]), reason failed to ping, tried [3] times, each
with maximum [30s] timeout

B:*[2013-11-17 20:25:27,543][INFO ][discovery.ec2 ] [B]
master_left [[A][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2013-11-17 20:25:27,547][INFO ][cluster.service ] [B]
master {new [B][sUv4amcFSdmaDAVDa7bUVg][inet[/<ipaddress>:9300]],
previous [A][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]]},
removed {[A][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]],},
reason: zen-disco-master_failed ([A][O25rauSQR7utohD0jg4RQw][inet[/<
ipaddress>:9300]])
C: [2013-11-17 20:23:28,256][INFO ][cluster.service ] [C]
removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/<ipaddress>:9300]],},
reason: zen-disco-receive(from master [[A
][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]]])

As you can see B is now a new master but A has not been removed as a
master, because A can still see C so has the minimum master node
criteria satisfied.

When I ask B for it's state it responds stating that it is a master
with C.

When I ask A for it's state it responds stating that it is a master
with C.

When I ask C for it's state it responds with the same cluster state as
A.

This can be replicated by setting up three nodes (settings above), then
once a master has been established drop the connection between it and what
you assume will be the next master (usually the next node in the list after
the master). I used the following commands:

On the master node (A): iptables -A INPUT -s <node B ip address> -j
DROP

On the next node (B): iptables -A INPUT -s <node A ip address> -j DROP

This should get you in the same state that I have witnessed in aws, once
two masters are established remove the iptables entries (running iptables
-F on A and B). From what I understand node discovery only happens
when a node is starting up or does not belong to a cluster, so as these
nodes do belong to a cluster they never discover each other.

I have tried this against versions 0.90.0, 0.90.4, 0.90.7 and
1.0.0.Beta1.zip of elasticsearch with no luck. I was using the
elasticsearch-cloud-aws plugin version 1.11.0 for elasticsearch version
0.90.0 and version 1.15.0 for elasticsearch versions 0.90.4, 0.90.7 and
1.0.0.Beta1.

I do not want to have to set minimum master nodes to 3 as for this use
case I value availability.

Any help would be greatly appreciated.

Kind Regards,

Mark Tinsley

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · November 20, 2013, 3:55pm

Ha thanks Leonardo!

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 20 novembre 2013 at 16:50:50, Leonardo Menezes (leonardo.menezess@gmail.com) a écrit:

this issue is already reported here: https://github.com/elasticsearch/elasticsearch/issues/2488

no solution though.

On Wed, Nov 20, 2013 at 4:36 PM, David Pilato david@pilato.fr wrote:
I think you should open an issue in elasticsearch project with that excellent description you wrote.
Don't know how it could be fixed though.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 20 novembre 2013 at 10:52:11, Mark Tinsley (markctinsley@gmail.com) a écrit:

Hi all,

I have been having some strange occurrences using elasticsearch on aws.

The setup is three nodes each with the setting of:
cluster.name:
bootstrap.mlockall: true
discovery.zen.ping.multicast.enabled : false
discovery.type : ec2
discovery.ec2.ping_timeout : 30s
discovery.ec2.groups:
cloud.aws.region :
action.disable_delete_all_indices : true
discovery.zen.minimum_master_nodes : 2

I have witnessed two occurrences of the following:
Given 3 nodes A, B, C. Which are all in the same availability zone.
To start with all nodes are connected in the cluster. A is the master.
For some reason, node A and B cannot talk to each other. but both can still talk to C and C can talk to A and B i.e. a 'on the fence' network partition as C can still see all:
A:[2013-11-17 20:23:28,257][INFO ][cluster.service ] [A] removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/:9300]],}, reason: zen-disco-node_failed([B][sUv4amcFSdmaDAVDa7bUVg][inet[/:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
B:[2013-11-17 20:25:27,543][INFO ][discovery.ec2 ] [B] master_left [[A][O25rauSQR7utohD0jg4RQw][inet[/:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2013-11-17 20:25:27,547][INFO ][cluster.service ] [B] master {new [B][sUv4amcFSdmaDAVDa7bUVg][inet[/:9300]], previous [A][O25rauSQR7utohD0jg4RQw][inet[/:9300]]}, removed {[A][O25rauSQR7utohD0jg4RQw][inet[/:9300]],}, reason: zen-disco-master_failed ([A][O25rauSQR7utohD0jg4RQw][inet[/:9300]])
C: [2013-11-17 20:23:28,256][INFO ][cluster.service ] [C] removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/:9300]],}, reason: zen-disco-receive(from master [[A][O25rauSQR7utohD0jg4RQw][inet[/:9300]]])
As you can see B is now a new master but A has not been removed as a master, because A can still see C so has the minimum master node criteria satisfied.

When I ask B for it's state it responds stating that it is a master with C.

When I ask A for it's state it responds stating that it is a master with C.

When I ask C for it's state it responds with the same cluster state as A.

This can be replicated by setting up three nodes (settings above), then once a master has been established drop the connection between it and what you assume will be the next master (usually the next node in the list after the master). I used the following commands:

On the master node (A): iptables -A INPUT -s -j DROP

On the next node (B): iptables -A INPUT -s -j DROP

This should get you in the same state that I have witnessed in aws, once two masters are established remove the iptables entries (running iptables -F on A and B). From what I understand node discovery only happens when a node is starting up or does not belong to a cluster, so as these nodes do belong to a cluster they never discover each other.

I have tried this against versions 0.90.0, 0.90.4, 0.90.7 and 1.0.0.Beta1.zip of elasticsearch with no luck. I was using the elasticsearch-cloud-aws plugin version 1.11.0 for elasticsearch version 0.90.0 and version 1.15.0 for elasticsearch versions 0.90.4, 0.90.7 and 1.0.0.Beta1.

I do not want to have to set minimum master nodes to 3 as for this use case I value availability.

Any help would be greatly appreciated.

Kind Regards,

Mark Tinsley

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mark_Tinsley · November 20, 2013, 4:49pm

Thanks for the replies, I'll take a look at elasticsearch-zookeeper solution

Cheers,

On Wednesday, November 20, 2013 9:52:07 AM UTC, Mark Tinsley wrote:

Hi all,

I have been having some strange occurrences using elasticsearch on aws.

The setup is three nodes each with the setting of:
cluster.name: <clustername>
bootstrap.mlockall: true
discovery.zen.ping.multicast.enabled : false
discovery.type : ec2
discovery.ec2.ping_timeout : 30s
discovery.ec2.groups: <group>
cloud.aws.region : <region>
action.disable_delete_all_indices : true
discovery.zen.minimum_master_nodes : 2

I have witnessed two occurrences of the following:
Given 3 nodes A, B, C. Which are all in the same availability zone.

To start with all nodes are connected in the cluster. A is the
master.

For some reason, node A and B cannot talk to each other. but
both can still talk to C and C can talk to A and B i.e. a 'on
the fence' network partition as C can still see all:
A:[2013-11-17 20:23:28,257][INFO ][cluster.service ] [A]
removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/<ipaddress>:9300]],},
reason: zen-disco-node_failed([B][sUv4amcFSdmaDAVDa7bUVg][inet[/<
ipaddress>:9300]]), reason failed to ping, tried [3] times, each
with maximum [30s] timeout

B:*[2013-11-17 20:25:27,543][INFO ][discovery.ec2 ] [B]
master_left [[A][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2013-11-17 20:25:27,547][INFO ][cluster.service ] [B]
master {new [B][sUv4amcFSdmaDAVDa7bUVg][inet[/<ipaddress>:9300]],
previous [A][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]]},
removed {[A][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]],},
reason: zen-disco-master_failed ([A][O25rauSQR7utohD0jg4RQw][inet[/<
ipaddress>:9300]])
C: [2013-11-17 20:23:28,256][INFO ][cluster.service ] [C]
removed {[B][sUv4amcFSdmaDAVDa7bUVg][inet[/<ipaddress>:9300]],},
reason: zen-disco-receive(from master [[A
][O25rauSQR7utohD0jg4RQw][inet[/<ipaddress>:9300]]])

As you can see B is now a new master but A has not been removed as a
master, because A can still see C so has the minimum master node
criteria satisfied.

When I ask B for it's state it responds stating that it is a master
with C.

When I ask A for it's state it responds stating that it is a master
with C.

When I ask C for it's state it responds with the same cluster state as
A.

This can be replicated by setting up three nodes (settings above), then
once a master has been established drop the connection between it and what
you assume will be the next master (usually the next node in the list after
the master). I used the following commands:

On the master node (A): iptables -A INPUT -s <node B ip address> -j
DROP

On the next node (B): iptables -A INPUT -s <node A ip address> -j DROP

This should get you in the same state that I have witnessed in aws, once
two masters are established remove the iptables entries (running iptables
-F on A and B). From what I understand node discovery only happens
when a node is starting up or does not belong to a cluster, so as these
nodes do belong to a cluster they never discover each other.

I have tried this against versions 0.90.0, 0.90.4, 0.90.7 and
1.0.0.Beta1.zip of elasticsearch with no luck. I was using the
elasticsearch-cloud-aws plugin version 1.11.0 for elasticsearch version
0.90.0 and version 1.15.0 for elasticsearch versions 0.90.4, 0.90.7 and
1.0.0.Beta1.

I do not want to have to set minimum master nodes to 3 as for this use
case I value availability.

Any help would be greatly appreciated.

Kind Regards,

Mark Tinsley

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Split brain due to 'on the fence' network partition

-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.