Minimum_master_nodes 19.8 vs 19.7 - recovery after multiple network disconnects

Hi All,

I've been doing some testing of a 3 node cluster with minimum_master_nodes
set to 2.

The summary outcome is:

Version 19.7 and below recover from multiple network disconnects (without
rebooting nodes) and honor the minimum_master_nodes setting, avoiding a
split brain. Version 19.8 recovers from the first network disconnect, but
fails to recover from the second, the disconnected node elects itself as
master and a split brain occurs.

Test setup:

3 nodes with the following config:

cluster.name: splitbrain
node.name: node[1,2,3]

discovery:
zen:
minimum_master_nodes: 2

Each node has a different node.name.

Test steps:

  1. Start node1, node2, and node3
  2. Create an index called "test1" default shards/replicas (5/1)
  3. Yank the network cable from node3
  4. After 30 seconds, node3 gets a ping failure
  5. Another 30 seconds, node3 gets another ping failure
  6. Another 30 seconds, node3 reports that there are not enough nodes "[WARN
    ][discovery.zen ] [node3] not enough master nodes after master
    left (reason = transport disconnected (with verified connect)), current
    nodes: {[node3][PKXJFl57R82PNhhD0p-n1Q][inet[/192.168.7.22:9300]],}"
  7. Node 3 then goes into a loop pinging the other nodes
  8. Reconnect network cable
  9. Node 3 rejoins cleanly

With versions 19.7 and below, I can repeat steps 2-9 over and over. Each
run through, the disconnected node behaves cleanly, always prints the above
message and re-joins the cluster cleanly.

Note that I'm not rebooting/restarting any nodes before repeating steps 2-9.

With version 19.8, the first run through of steps 2-9 works as expected.
On the second run through, the disconnected node behaves differently:

On the second run through with v19.8, the second ping failure(step 5)
doesn't get logged. Instead, the following log message starts repeating:

[2012-07-24 11:55:00,270][WARN ][cluster.service ] [node3] failed
to reconnect to node
[node2][aPugNxMpTvCUDvjNfYKFjA][inet[/192.168.7.132:9301]]
org.elasticsearch.transport.ConnectTransportException:
[node2][inet[/192.168.7.132:9301]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:563)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:505)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:483)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
at
org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes.run(InternalClusterService.java:377)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:532)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:139)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:102)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:573)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:642)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:205)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:230)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:550)
... 7 more

The node then elects itself as master, still reports itself as connected to
one of the other nodes. Upon reconnecting the network, we have a split
brain, where node3(master) thinks it's connected to node2. node1(master)
and node2 are connected and both don't know anything about node3 (nor do
they log anything about node3 upon re-connection).

It's a long shot, but could the netty upgrade in 19.8:

Have caused this error?

We could try to detect when a network error has occurred and reboot the
nodes, but this feels like a step back as pre- 19.8 we were resilient to
multiple network disconnects without rebooting.

Warm Regards,

Owen Butler

After a bunch of testing, it seems like this is not related to 0.19.8 vs
0.19.7 as I said in the title.

The error occurs if the nodeId of the disconnected node is ordered in such
a way that the isolated node elects itself as master (and again, only the
second time around!).

I've raised this as a bug and full details including configs/etc are
attached as a gist:

If anyone can also reproduce this using the configs/etc that would be much
appreciated.

Cheers,

Owen Butler

On Tuesday, July 24, 2012 1:52:01 PM UTC+10, Owen Butler wrote:

Hi All,

I've been doing some testing of a 3 node cluster with minimum_master_nodes
set to 2.

The summary outcome is:

Version 19.7 and below recover from multiple network disconnects (without
rebooting nodes) and honor the minimum_master_nodes setting, avoiding a
split brain. Version 19.8 recovers from the first network disconnect, but
fails to recover from the second, the disconnected node elects itself as
master and a split brain occurs.

Test setup:

3 nodes with the following config:

cluster.name: splitbrain
node.name: node[1,2,3]

discovery:
zen:
minimum_master_nodes: 2

Each node has a different node.name.

Test steps:

  1. Start node1, node2, and node3
  2. Create an index called "test1" default shards/replicas (5/1)
  3. Yank the network cable from node3
  4. After 30 seconds, node3 gets a ping failure
  5. Another 30 seconds, node3 gets another ping failure
  6. Another 30 seconds, node3 reports that there are not enough nodes "[WARN
    ][discovery.zen ] [node3] not enough master nodes after master
    left (reason = transport disconnected (with verified connect)), current
    nodes: {[node3][PKXJFl57R82PNhhD0p-n1Q][inet[/192.168.7.22:9300]],}"
  7. Node 3 then goes into a loop pinging the other nodes
  8. Reconnect network cable
  9. Node 3 rejoins cleanly

With versions 19.7 and below, I can repeat steps 2-9 over and over. Each
run through, the disconnected node behaves cleanly, always prints the above
message and re-joins the cluster cleanly.

Note that I'm not rebooting/restarting any nodes before repeating steps
2-9.

With version 19.8, the first run through of steps 2-9 works as expected.
On the second run through, the disconnected node behaves differently:

On the second run through with v19.8, the second ping failure(step 5)
doesn't get logged. Instead, the following log message starts repeating:

[2012-07-24 11:55:00,270][WARN ][cluster.service ] [node3] failed
to reconnect to node [node2][aPugNxMpTvCUDvjNfYKFjA][inet[/
192.168.7.132:9301]]
org.elasticsearch.transport.ConnectTransportException: [node2][inet[/
192.168.7.132:9301]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:563)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:505)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:483)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
at
org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes.run(InternalClusterService.java:377)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:532)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:139)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:102)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:573)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:642)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:205)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:230)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:550)
... 7 more

The node then elects itself as master, still reports itself as connected
to one of the other nodes. Upon reconnecting the network, we have a split
brain, where node3(master) thinks it's connected to node2. node1(master)
and node2 are connected and both don't know anything about node3 (nor do
they log anything about node3 upon re-connection).

It's a long shot, but could the netty upgrade in 19.8:

Upgrade to Netty 3.5.2, closes #2084. · elastic/elasticsearch@5f1b1c6 · GitHub

Have caused this error?

We could try to detect when a network error has occurred and reboot the
nodes, but this feels like a step back as pre- 19.8 we were resilient to
multiple network disconnects without rebooting.

Warm Regards,

Owen Butler