Minimum_master_nodes 19.8 vs 19.7 - recovery after multiple network disconnects

Owen_Butler_2 · July 24, 2012, 3:52am

Hi All,

I've been doing some testing of a 3 node cluster with minimum_master_nodes
set to 2.

The summary outcome is:

Version 19.7 and below recover from multiple network disconnects (without
rebooting nodes) and honor the minimum_master_nodes setting, avoiding a
split brain. Version 19.8 recovers from the first network disconnect, but
fails to recover from the second, the disconnected node elects itself as
master and a split brain occurs.

Test setup:

3 nodes with the following config:

cluster.name: splitbrain
node.name: node[1,2,3]

discovery:
zen:
minimum_master_nodes: 2

Each node has a different node.name.

Test steps:

Start node1, node2, and node3
Create an index called "test1" default shards/replicas (5/1)
Yank the network cable from node3
After 30 seconds, node3 gets a ping failure
Another 30 seconds, node3 gets another ping failure
Another 30 seconds, node3 reports that there are not enough nodes "[WARN
][discovery.zen ] [node3] not enough master nodes after master
left (reason = transport disconnected (with verified connect)), current
nodes: {[node3][PKXJFl57R82PNhhD0p-n1Q][inet[/192.168.7.22:9300]],}"
Node 3 then goes into a loop pinging the other nodes
Reconnect network cable
Node 3 rejoins cleanly

With versions 19.7 and below, I can repeat steps 2-9 over and over. Each
run through, the disconnected node behaves cleanly, always prints the above
message and re-joins the cluster cleanly.

Note that I'm not rebooting/restarting any nodes before repeating steps 2-9.

With version 19.8, the first run through of steps 2-9 works as expected.
On the second run through, the disconnected node behaves differently:

On the second run through with v19.8, the second ping failure(step 5)
doesn't get logged. Instead, the following log message starts repeating:

[2012-07-24 11:55:00,270][WARN ][cluster.service ] [node3] failed
to reconnect to node
[node2][aPugNxMpTvCUDvjNfYKFjA][inet[/192.168.7.132:9301]]
org.elasticsearch.transport.ConnectTransportException:
[node2][inet[/192.168.7.132:9301]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:563)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:505)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:483)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
at
org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes.run(InternalClusterService.java:377)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:532)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:139)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:102)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:573)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:642)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:205)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:230)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:550)
... 7 more

The node then elects itself as master, still reports itself as connected to
one of the other nodes. Upon reconnecting the network, we have a split
brain, where node3(master) thinks it's connected to node2. node1(master)
and node2 are connected and both don't know anything about node3 (nor do
they log anything about node3 upon re-connection).

It's a long shot, but could the netty upgrade in 19.8:

Have caused this error?

We could try to detect when a network error has occurred and reboot the
nodes, but this feels like a step back as pre- 19.8 we were resilient to
multiple network disconnects without rebooting.

Warm Regards,

Owen Butler

Owen_Butler_2 · July 25, 2012, 6:57am

After a bunch of testing, it seems like this is not related to 0.19.8 vs
0.19.7 as I said in the title.

The error occurs if the nodeId of the disconnected node is ordered in such
a way that the isolated node elects itself as master (and again, only the
second time around!).

I've raised this as a bug and full details including configs/etc are
attached as a gist:

github.com/elastic/elasticsearch

split brain condition after second network disconnect - even with minimum_master_nodes set

opened 06:52AM - 25 Jul 12 UTC

closed 02:09AM - 16 Jun 14 UTC

owenbutler

>bug

## Summary: Split brain can occur on the second network disconnect of a node, w…hen the minimum_master_nodes is configured correctly(n/2+1). The split brain occurs if the nodeId(UUID) of the disconnected node is such that the disconnected node picks itself as the next logical master while pinging the other nodes(NodeFaultDetection). The split brain only occurs on the second time that the node is disconnected/isolated. ## Detail: Using ZenDiscovery, Node Id's are randomly generated(A UUID): ZenDiscovery:169. When the node is disconnected/isolated it the ElectMasterService uses an ordered list of the Nodes (Ordered by nodeId) to determine a new potential master. It picks the first of the ordered list: ElectMasterService:95 Because the nodeId's are random, it's possible for the disconnected/isolated node to be first in the ordered list, electing itself as a possible master. The first time network is disconnected, the minimum_master_nodes property is honored and the disconnected/isolated node goes into a "ping" mode, where it simply tries to ping for other nodes. Once the network is re-connected, the node re-joins the cluster successfully. The Second time the network is disconnected, the minimum_master_nodes intent is not honored. The disconnected/isolated node fails to realise that it's not connected to the remaining node in the 3 node cluster and elects itself as master, still thinking it's connected. It feels like there is a failure in the transition between MasterFaultDetection and NodeFaultDetection, because it works the first time! The fault only occurs if the nodeId is ordered such that the disconnected node picks itself as the master while isolated. If the nodeId's are ordered such that it picks one of the other 2 nodes to be potential master then the isolated node honors the minimum_master_nodes intent every time. Because the nodeId's are randomly(UUID) generated, the probability of this occuring drops as the number of nodes in the cluster goes up. For our 3 node cluster it's ~50% (with one node detected as gone, it's up to the ordering of the remaining two nodeId's) Note, While we were trying track this down we found that the cluster.service TRACE level logging (which outputs the cluster state) does not list the nodes in election order. IE, the first node in that printed list is not necessarily going to elected as master by the isolated node. ## Detail Steps to reproduce: Because the ordering of the nodeId's is random(UUID) we were having trouble getting a consitantly reproducable test case. To fix the ordering, we made a patch to ZenDiscovery to allow us to optionally configure a nodeId. This allowed us to set the nodeId of the disconnected/isolated node to guarantee it's ordering, allowing us to consistently reproduce. We've tested this scenario on the 0.19.4, 0.19.7, 0.19.8 distributions and see the error when the nodeId's were ordered just right. We also tested this scenario on the current git master with the supplied patch. In this scenario, node3 will the be the node we disconnect/isolate. So we start the nodes up in numerical order to ensure node3 doesn't _start_ as master. 1. Configure nodes with attached configs (one is provided for each node) 2. Start up nodes 1 and 2. After they are attached and one is master, start node 3 3. Create a blank index with default shard/replica(5/1) settings 4. Pull network cable from node 3 5. Node 3 detects master has gone (MasterFaultDetection) 6. Node 3 elects itself as master (Because the nodeId's are ordered just right) 7. Node 3 detects the remaining node has gone, enters ZenDiscovery minimum_master_nodes mode, prints a message indicating not enough nodes 8. Node 3 goes into a ping state looking for nodes 9. At this point, node 1 and node 2 report a valid cluster, they know about each other but not about node 3. 10. Reconnect network to node 3 11. Node 3 rejoins the cluster correctly, seeing that there is already a master in the cluster. At this point, everything is working as expected. 1. Pull network cable from node 3 again 2. Node 3 detects master has gone (MasterFaultDetection) 3. Node 3 elects as itself as master (Because the nodeId's are ordered just right) 4. Node 3 now fails to detect that the remaining node in the cluster is not accessible. It starts throwing a number of Netty NoRouteToHostExceptions about the remaining node. 5. According to node 3, cluster health is yellow and cluster state shows 2 data nodes 6. Reconnect network to node 3 7. Node 3 appears to connect to the node that it thinks it's still connected to. (can see that via the cluster state api). The other nodes log nothing and do not show the disconnected node as connected in any way. 8. Node 3 at this point accepts indexing and search requests, a classic split brain. Here's a gist with the patch to ZenDiscovery and the 3 node configs. https://gist.github.com/3174651

If anyone can also reproduce this using the configs/etc that would be much
appreciated.

Cheers,

Owen Butler

On Tuesday, July 24, 2012 1:52:01 PM UTC+10, Owen Butler wrote:

Hi All,

I've been doing some testing of a 3 node cluster with minimum_master_nodes
set to 2.

The summary outcome is:

Version 19.7 and below recover from multiple network disconnects (without
rebooting nodes) and honor the minimum_master_nodes setting, avoiding a
split brain. Version 19.8 recovers from the first network disconnect, but
fails to recover from the second, the disconnected node elects itself as
master and a split brain occurs.

Test setup:

3 nodes with the following config:

cluster.name: splitbrain
node.name: node[1,2,3]

discovery:
zen:
minimum_master_nodes: 2

Each node has a different node.name.

Test steps:

Start node1, node2, and node3

Create an index called "test1" default shards/replicas (5/1)

Yank the network cable from node3

After 30 seconds, node3 gets a ping failure

Another 30 seconds, node3 gets another ping failure

Another 30 seconds, node3 reports that there are not enough nodes "[WARN
][discovery.zen ] [node3] not enough master nodes after master
left (reason = transport disconnected (with verified connect)), current
nodes: {[node3][PKXJFl57R82PNhhD0p-n1Q][inet[/192.168.7.22:9300]],}"

Node 3 then goes into a loop pinging the other nodes

Reconnect network cable

Node 3 rejoins cleanly

With versions 19.7 and below, I can repeat steps 2-9 over and over. Each
run through, the disconnected node behaves cleanly, always prints the above
message and re-joins the cluster cleanly.

Note that I'm not rebooting/restarting any nodes before repeating steps
2-9.

With version 19.8, the first run through of steps 2-9 works as expected.
On the second run through, the disconnected node behaves differently:

On the second run through with v19.8, the second ping failure(step 5)
doesn't get logged. Instead, the following log message starts repeating:

[2012-07-24 11:55:00,270][WARN ][cluster.service ] [node3] failed
to reconnect to node [node2][aPugNxMpTvCUDvjNfYKFjA][inet[/
192.168.7.132:9301]]
org.elasticsearch.transport.ConnectTransportException: [node2][inet[/
192.168.7.132:9301]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:563)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:505)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:483)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
at
org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes.run(InternalClusterService.java:377)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:532)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:139)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:102)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:573)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:642)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:205)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:230)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:550)
... 7 more

The node then elects itself as master, still reports itself as connected
to one of the other nodes. Upon reconnecting the network, we have a split
brain, where node3(master) thinks it's connected to node2. node1(master)
and node2 are connected and both don't know anything about node3 (nor do
they log anything about node3 upon re-connection).

It's a long shot, but could the netty upgrade in 19.8:

Upgrade to Netty 3.5.2, closes #2084. · elastic/elasticsearch@5f1b1c6 · GitHub

Have caused this error?

We could try to detect when a network error has occurred and reboot the
nodes, but this feels like a step back as pre- 19.8 we were resilient to
multiple network disconnects without rebooting.

Warm Regards,

Owen Butler

Topic		Replies	Views
Discovery_zen disconnect issues Elasticsearch	5	404	July 6, 2017
Ping/Zen/minimum_master_nodes and unexpected behaviour Elasticsearch	4	387	July 6, 2017
Nodes fail to join cluster - potential split brain scenario Elasticsearch	11	562	July 6, 2017
Node not join the cluster so what happen about the data? Elasticsearch	4	362	July 6, 2017
Unexpected cluster state Elasticsearch	5	506	July 6, 2017

Minimum_master_nodes 19.8 vs 19.7 - recovery after multiple network disconnects

Related topics