Ping/Zen/minimum_master_nodes and unexpected behaviour

I recently went through a test of the handling of ES 0.19.8 for Split Brain
conditions using a simple 3-node cluster, and the outcome of it wasn't what
I expected. It is certainly possible my Test Method here is flawed, so
help here to point that out would be appreciated, but maybe this is the
expected behaviour and not what I thought it should be. I'll outline the
full steps to how I set this up, but basically I used iptables (see [2]
below) to shut off communication to/from one node so that no inbound or
outbound traffic from/to the other 2 nodes can occur.

In the case where 1 node is isolated from the other 2, when
discovery.zen.minimum_master_nodes (see [1] below for the full
configuration), I find that this single node still elects itself master.
The cluster is in a red state which is expected because it's lost quite a
few shards with the communication channels down, but it is still electing
itself master, the Cluster Status API returns that the local node is the
master. I thought minimum_master_nodes should flag this as a "you can't
elect yourself the master here, because there's not enough nodes" case
(what the minimum_master_nodes property is for).

Now, logically it's probably all it can do, but I was hoping that ES would
detect this split brain state, and refuse to act as a master, indeed
refuse to do much at all. Certainly electing itself to be the gateway
snapshot node is very unhealthy in this case, which is what happens when it
becomes the master for Shared FS Gateway that we use.

In this test case where only network flow between the nodes is blocked,
this local node still responds to search requests with results from the
shards it has. I'm torn with this behaviour, on the one hand it's nice to
get some results, but in this Split Brain case, all bets are off here
right? The validity of the shard content is now under question. But maybe
for many users any results you can get, even marked with shard search
failures is useful (the user can be notified that there's problems and that
the search may not contain all/accurate results etc.)

Another weird data point is that it took about 5 minutes for this cut-off
node to finally give up on 1 of the nodes, however it took a full 15
minutes before it gave up on the other. Here's the logs from the cut-off
node (furnace.engr.acx, the other 2 nodes are called anvil, and app1.yarra)
it was 8:07 when I setup the iptable rules:

[2012-07-19 08:13:20,362][INFO ][cluster.service ]
[furnace.engr.acx] removed {[anvil.engr.acx][t949B6eFQdWUyUzokOpGxw][i net[/192.168.7.239:9300]],}, reason:
zen-disco-node_failed([anvil.engr.acx][t949B6eFQdWUyUzokOpGxw][inet[/192.168.7.239:9300]]),
r
eason failed to ping, tried [3] times, each with maximum [30s] timeout
....
[2012-07-19 08:29:16,569][INFO ][cluster.service ]
[furnace.engr.acx] removed
{[app1.yarra][zIFK7ADWRv6fHmUa0aX3Jg][inet[/192.168.7.234:9300]],}, reason:
zen-disco-node_failed([app1.yarra][zIFK7ADWRv6fHmUa0aX3Jg][inet[/192.168.7.234:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

Looking at the NodesFaultDetection code that is responsible for performing
the regular Ping heartbeat, I'm just not sure why the 30 second timeout is
not working properly here... it eventually does, but 3x30seconds for
retries is very much shorter than the 20-odd minutes it took for the 2nd
node to finally be marked as 'gone' from this isolated node.

Additionally, and probably the bigger problem here is that after
returning network conditions to normal cases (purging the iptable
rules),after this isolated node had given up on the other 2, the cluster
did not rejoin correctly once network flow resumed. I waited 10 minutes,
but there was nothing in the logs showing discovery happening from both
halves of the cluster appearing. Since Discovery (multicast) only runs on
startup/shutdown, this isn't helping.

Since this node was isolated though, and the other 2 nodes were removed
locally, how will it ever get to know about the other 2 nodes? The
NodeFaultDetection socket is broken at this point, and Discovery only
happens on startup/shutdown... ?

I was forced to restart the ES on the node that was isolated, and then
immediately they rejoined. I was under the impression that the Zen
protocol should continue to ping heartbeats in a desperate attempt to
locate any other node out there, so when network conditions recover, I
don't quite understand why they don't just magically rejoin properly.

In particular I was expecting to see
https://github.com/elasticsearch/elasticsearch/issues/2042 come in to play
here with the 2 masters deciding on who should win, but because the
discovery wasn't happening, that didn't occur.

I know that a 'normal' Split Brain is the result of something like a
network cable getting yanked. However another 'likely' case is a
misconfigured firewall or something (something the iptables block rules
sort of simulate).

Is my test screwed or misguided here? Are my expectations incorrect?

cheers,

Paul Smith

[1] Sample elasticsearch.yml configuration

All 3-nodes have a very basic config, identical except for the node name
obviously:

cluster.name: engr
node.name: furnace.engr.acx
path.data: /aconex/elasticsearch-data

discovery:
zen:
minimum_master_nodes=2

[2] iptable config done from the cut-off node furnace.engr.acx, these are
the IP addresses of the other 2 nodes to drop packets going in/out
iptables -A INPUT -s 192.168.7.234 -j DROP
iptables -A INPUT -s 192.168.7.239 -j DROP
iptables -A OUTPUT -d 192.168.7.234 -j DROP
iptables -A OUTPUT -d 192.168.7.239 -j DROP

It should be:

discovery:
zen:
minimum_master_nodes: 2

On Wednesday, July 18, 2012 10:52:24 PM UTC-4, tallpsmith wrote:

I recently went through a test of the handling of ES 0.19.8 for Split
Brain conditions using a simple 3-node cluster, and the outcome of it
wasn't what I expected. It is certainly possible my Test Method here is
flawed, so help here to point that out would be appreciated, but maybe this
is the expected behaviour and not what I thought it should be. I'll
outline the full steps to how I set this up, but basically I used iptables
(see [2] below) to shut off communication to/from one node so that no
inbound or outbound traffic from/to the other 2 nodes can occur.

In the case where 1 node is isolated from the other 2, when
discovery.zen.minimum_master_nodes (see [1] below for the full
configuration), I find that this single node still elects itself master.
The cluster is in a red state which is expected because it's lost quite a
few shards with the communication channels down, but it is still electing
itself master, the Cluster Status API returns that the local node is the
master. I thought minimum_master_nodes should flag this as a "you can't
elect yourself the master here, because there's not enough nodes" case
(what the minimum_master_nodes property is for).

Now, logically it's probably all it can do, but I was hoping that ES would
detect this split brain state, and refuse to act as a master, indeed
refuse to do much at all. Certainly electing itself to be the gateway
snapshot node is very unhealthy in this case, which is what happens when it
becomes the master for Shared FS Gateway that we use.

In this test case where only network flow between the nodes is blocked,
this local node still responds to search requests with results from the
shards it has. I'm torn with this behaviour, on the one hand it's nice to
get some results, but in this Split Brain case, all bets are off here
right? The validity of the shard content is now under question. But maybe
for many users any results you can get, even marked with shard search
failures is useful (the user can be notified that there's problems and that
the search may not contain all/accurate results etc.)

Another weird data point is that it took about 5 minutes for this cut-off
node to finally give up on 1 of the nodes, however it took a full 15
minutes before it gave up on the other. Here's the logs from the cut-off
node (furnace.engr.acx, the other 2 nodes are called anvil, and app1.yarra)
it was 8:07 when I setup the iptable rules:

[2012-07-19 08:13:20,362][INFO ][cluster.service ]
[furnace.engr.acx] removed {[anvil.engr.acx][t949B6eFQdWUyUzokOpGxw][i net[/192.168.7.239:9300]],}, reason:
zen-disco-node_failed([anvil.engr.acx][t949B6eFQdWUyUzokOpGxw][inet[/192.168.7.239:9300]]),
r
eason failed to ping, tried [3] times, each with maximum [30s] timeout
....
[2012-07-19 08:29:16,569][INFO ][cluster.service ]
[furnace.engr.acx] removed
{[app1.yarra][zIFK7ADWRv6fHmUa0aX3Jg][inet[/192.168.7.234:9300]],}, reason:
zen-disco-node_failed([app1.yarra][zIFK7ADWRv6fHmUa0aX3Jg][inet[/192.168.7.234:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

Looking at the NodesFaultDetection code that is responsible for performing
the regular Ping heartbeat, I'm just not sure why the 30 second timeout is
not working properly here... it eventually does, but 3x30seconds for
retries is very much shorter than the 20-odd minutes it took for the 2nd
node to finally be marked as 'gone' from this isolated node.

Additionally, and probably the bigger problem here is that after
returning network conditions to normal cases (purging the iptable
rules),after this isolated node had given up on the other 2, the cluster
did not rejoin correctly once network flow resumed. I waited 10 minutes,
but there was nothing in the logs showing discovery happening from both
halves of the cluster appearing. Since Discovery (multicast) only runs on
startup/shutdown, this isn't helping.

Since this node was isolated though, and the other 2 nodes were removed
locally, how will it ever get to know about the other 2 nodes? The
NodeFaultDetection socket is broken at this point, and Discovery only
happens on startup/shutdown... ?

I was forced to restart the ES on the node that was isolated, and then
immediately they rejoined. I was under the impression that the Zen
protocol should continue to ping heartbeats in a desperate attempt to
locate any other node out there, so when network conditions recover, I
don't quite understand why they don't just magically rejoin properly.

In particular I was expecting to see
Improve cluster resiliency to disconnected sub clusters + fix a shard allocation bug with quick rolling restarts · Issue #2042 · elastic/elasticsearch · GitHub come in to
play here with the 2 masters deciding on who should win, but because the
discovery wasn't happening, that didn't occur.

I know that a 'normal' Split Brain is the result of something like a
network cable getting yanked. However another 'likely' case is a
misconfigured firewall or something (something the iptables block rules
sort of simulate).

Is my test screwed or misguided here? Are my expectations incorrect?

cheers,

Paul Smith

[1] Sample elasticsearch.yml configuration

All 3-nodes have a very basic config, identical except for the node name
obviously:

cluster.name: engr
node.name: furnace.engr.acx
path.data: /aconex/elasticsearch-data

discovery:
zen:
minimum_master_nodes=2

[2] iptable config done from the cut-off node furnace.engr.acx, these are
the IP addresses of the other 2 nodes to drop packets going in/out
iptables -A INPUT -s 192.168.7.234 -j DROP
iptables -A INPUT -s 192.168.7.239 -j DROP
iptables -A OUTPUT -d 192.168.7.234 -j DROP
iptables -A OUTPUT -d 192.168.7.239 -j DROP

On 21 July 2012 05:27, Igor Motov imotov@gmail.com wrote:

It should be:

discovery:
zen:
minimum_master_nodes: 2

Paul looks for a very large hole to hide in

Well, that's um... embarrassing. :slight_smile:

Now when I apply this I get MUCH better expected outcomes. One node's
pings break within 90 seconds of the firewall (3x30 second ping retries).
The 2nd node does something similar, however it sits there without
properly disconnecting for 18 minutes before giving up in a similar way.
At that point, it says there's not enough masters and refuses to be useful
to search request and cluster state, which is great.

I think the oddness of the 2nd node and the 18 minutes could be related to
that host/tcp stuff. I will be trying this on a different cluster in a
more production like setup to see if both nodes 'disappear' with respect to
the cut off node in a similar time frame.

Thank Igor for catching this face-palm error.. :slight_smile:

cheers,

Paul

Paul looks for a very large hole to hide in

:smiley:

Thank Igor for catching this face-palm error.. :slight_smile:

It took me a good two minutes of staring at both versions to spot the
difference