I recently went through a test of the handling of ES 0.19.8 for Split Brain
conditions using a simple 3-node cluster, and the outcome of it wasn't what
I expected. It is certainly possible my Test Method here is flawed, so
help here to point that out would be appreciated, but maybe this is the
expected behaviour and not what I thought it should be. I'll outline the
full steps to how I set this up, but basically I used iptables (see [2]
below) to shut off communication to/from one node so that no inbound or
outbound traffic from/to the other 2 nodes can occur.
In the case where 1 node is isolated from the other 2, when
discovery.zen.minimum_master_nodes (see [1] below for the full
configuration), I find that this single node still elects itself master.
The cluster is in a red state which is expected because it's lost quite a
few shards with the communication channels down, but it is still electing
itself master, the Cluster Status API returns that the local node is the
master. I thought minimum_master_nodes should flag this as a "you can't
elect yourself the master here, because there's not enough nodes" case
(what the minimum_master_nodes property is for).
Now, logically it's probably all it can do, but I was hoping that ES would
detect this split brain state, and refuse to act as a master, indeed
refuse to do much at all. Certainly electing itself to be the gateway
snapshot node is very unhealthy in this case, which is what happens when it
becomes the master for Shared FS Gateway that we use.
In this test case where only network flow between the nodes is blocked,
this local node still responds to search requests with results from the
shards it has. I'm torn with this behaviour, on the one hand it's nice to
get some results, but in this Split Brain case, all bets are off here
right? The validity of the shard content is now under question. But maybe
for many users any results you can get, even marked with shard search
failures is useful (the user can be notified that there's problems and that
the search may not contain all/accurate results etc.)
Another weird data point is that it took about 5 minutes for this cut-off
node to finally give up on 1 of the nodes, however it took a full 15
minutes before it gave up on the other. Here's the logs from the cut-off
node (furnace.engr.acx, the other 2 nodes are called anvil, and app1.yarra)
it was 8:07 when I setup the iptable rules:
[2012-07-19 08:13:20,362][INFO ][cluster.service ]
[furnace.engr.acx] removed {[anvil.engr.acx][t949B6eFQdWUyUzokOpGxw][i
net[/192.168.7.239:9300]],}, reason:
zen-disco-node_failed([anvil.engr.acx][t949B6eFQdWUyUzokOpGxw][inet[/192.168.7.239:9300]]),
r
eason failed to ping, tried [3] times, each with maximum [30s] timeout
....
[2012-07-19 08:29:16,569][INFO ][cluster.service ]
[furnace.engr.acx] removed
{[app1.yarra][zIFK7ADWRv6fHmUa0aX3Jg][inet[/192.168.7.234:9300]],}, reason:
zen-disco-node_failed([app1.yarra][zIFK7ADWRv6fHmUa0aX3Jg][inet[/192.168.7.234:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout
Looking at the NodesFaultDetection code that is responsible for performing
the regular Ping heartbeat, I'm just not sure why the 30 second timeout is
not working properly here... it eventually does, but 3x30seconds for
retries is very much shorter than the 20-odd minutes it took for the 2nd
node to finally be marked as 'gone' from this isolated node.
Additionally, and probably the bigger problem here is that after
returning network conditions to normal cases (purging the iptable
rules),after this isolated node had given up on the other 2, the cluster
did not rejoin correctly once network flow resumed. I waited 10 minutes,
but there was nothing in the logs showing discovery happening from both
halves of the cluster appearing. Since Discovery (multicast) only runs on
startup/shutdown, this isn't helping.
Since this node was isolated though, and the other 2 nodes were removed
locally, how will it ever get to know about the other 2 nodes? The
NodeFaultDetection socket is broken at this point, and Discovery only
happens on startup/shutdown... ?
I was forced to restart the ES on the node that was isolated, and then
immediately they rejoined. I was under the impression that the Zen
protocol should continue to ping heartbeats in a desperate attempt to
locate any other node out there, so when network conditions recover, I
don't quite understand why they don't just magically rejoin properly.
In particular I was expecting to see
https://github.com/elasticsearch/elasticsearch/issues/2042 come in to play
here with the 2 masters deciding on who should win, but because the
discovery wasn't happening, that didn't occur.
I know that a 'normal' Split Brain is the result of something like a
network cable getting yanked. However another 'likely' case is a
misconfigured firewall or something (something the iptables block rules
sort of simulate).
Is my test screwed or misguided here? Are my expectations incorrect?
cheers,
Paul Smith
[1] Sample elasticsearch.yml configuration
All 3-nodes have a very basic config, identical except for the node name
obviously:
cluster.name: engr
node.name: furnace.engr.acx
path.data: /aconex/elasticsearch-data
discovery:
zen:
minimum_master_nodes=2
[2] iptable config done from the cut-off node furnace.engr.acx, these are
the IP addresses of the other 2 nodes to drop packets going in/out
iptables -A INPUT -s 192.168.7.234 -j DROP
iptables -A INPUT -s 192.168.7.239 -j DROP
iptables -A OUTPUT -d 192.168.7.234 -j DROP
iptables -A OUTPUT -d 192.168.7.239 -j DROP