Split-Brain Detection

Background

We embed elasticsearch into our web application, and then let Amazon EC2
control the startup or teardown of our server instances based on load
metrics. The exact same web app is used across all instances, therefore each
new instance brings a new elasticsearch node into existence.

I am trying to bulletproof some scenarios, and just want to ask what the
best practice might be for some of these situations.

Split-Brain

Because there is no way to know how many servers will be running at any
particular time, I am unsure what I might be able to do to recognize
splitbrain scenarios and take servers down who are not part of the "primary"
cluster. My thoughts at the moment are a service that performs these steps:

  1. Identify all nodes in the EC2 group using EC2 API.
  2. Perform a cluster health check against all the nodes returned from
    this list.
  3. In a healthy cluster, the number of nodes identified in the ES list
    will match the EC2 list.
  4. In a split-brain scenario:
    1. I identify each group of nodes using the ES stats.
    2. I keep the largest group of nodes running.
    3. Remove the other nodes from the load balancer using EC2 API.
    4. I terminate the other nodes using EC2 API.
  5. Amazon EC2 will bring up new servers to replace the ones I terminated
    depending on load requirements.

If this works, it does so only because there is another API (EC2) I can
consult for a definitive list of all nodes in my cluster. I worry that this
process will be too slow because data loss and corruption can happen the
moment one of the split brain nodes is used.

Are there any other approaches I might take?

Could you start your 'euthanising' script (couldn't think of a better word,
sorry) by progressively killing off nodes, waiting for the transition from
yellow-> back to green as the shards re-replicate, then kill another one,
waiting long enough each time to ensure transition to yellow->green. After
a certain # node deaths, you'll be stuck at yellow because there's no not
enough nodes to satisfy all replica configurations, so after enough time
you've worked out you've gone just 1 too far, so need to restart 1 node?

I'm sure there's some mathematical way way though to compute the minimum #
nodes to at least satisfy a yellow status for you cluster (all primary
shards allocated, but no replicas available). this would become your
minimum_master_node setting perhaps? This is a bit gut feel, I haven't
done the maths on that one though, and is probably for large clusters with
lots of shards probably restrictive because it'll require too many masters
to satisfy that setting perhaps.

I'm very interested to see how you go with what ever solution you end up
going with. Let us know!

Paul Smith

Hi Paul,

We currently do set the minimum number of instances in a cluster to the
number of replicas + 1. This allows us to initialize a cluster in green
state. Of course from that point forward Amazon is creating and destroying
nodes, so using minimum_master_node cannot be predicted ahead of time.

If we get a split brain however, we might end up with a cluster in yellow
state as a side-effect of dropping below this threshold. There isn't a way
around this, but no data is lost. I would image that the split brains will
immediately start turning some replicas into primary shards to compensate
for the primary shards they have lost.

At this time, I think it would be safe to drop all of the nodes involved in
the split-brains that I will be euthanizing. There wouldn't be any reason
that I can foresee to wait for a transition to green is there?

The biggest problem is making sure that the split-brain I will be
euthanizing will not receive any updates before I can take them down or
remove them from the load balancer. I wonder if minimum_master_node could
help me in some way if it could be controlled dynamically?

For example, as AWS brings a new node online or takes one down, I set the
minimum_master_node to floor(total_nodes / 2) + 1? Maybe this would help me
make sure that an ES instance doesn't perform an update the moment
split-brain occurs. I am assuming that ES can react that fast to a drop
below min_master_node, but my hunch is it is tied to the zen heartbeat.

-- jim

Hi,

Yes, the minimum master nodes is the best way to guard against that. The
heartbeat to other nodes detect a node going down in an ordinal manner very
fast, and a node exiting abnormally almost as fast (broken socket). It can
be longer if the socket is left dangling.... .

Regarding changing the minimum master nodes dynamically, it can certainly
be enabled for people to set it through an API (a set cluster level settings
API is long overdue).

In theory, we can have something like automatic_minimum_master_nodes,
that can be set to something like 3-quorum, and elasticsearch will
automatically set it to either a minimum of 3 or a quorum of all eligible
master nodes in the cluster, but, this effectively means that you have a
value of 3, because you can't tell if a node is leaving because you brought
it down intentionally, or because it is dead. So its not really possible
(just throwing it out there, in case you thought about it).

On Sat, Aug 13, 2011 at 1:14 PM, James Cook jcook@tracermedia.com wrote:

Hi Paul,

We currently do set the minimum number of instances in a cluster to the
number of replicas + 1. This allows us to initialize a cluster in green
state. Of course from that point forward Amazon is creating and destroying
nodes, so using minimum_master_node cannot be predicted ahead of time.

If we get a split brain however, we might end up with a cluster in yellow
state as a side-effect of dropping below this threshold. There isn't a way
around this, but no data is lost. I would image that the split brains will
immediately start turning some replicas into primary shards to compensate
for the primary shards they have lost.

At this time, I think it would be safe to drop all of the nodes involved in
the split-brains that I will be euthanizing. There wouldn't be any reason
that I can foresee to wait for a transition to green is there?

The biggest problem is making sure that the split-brain I will be
euthanizing will not receive any updates before I can take them down or
remove them from the load balancer. I wonder if minimum_master_node could
help me in some way if it could be controlled dynamically?

For example, as AWS brings a new node online or takes one down, I set the
minimum_master_node to floor(total_nodes / 2) + 1? Maybe this would help me
make sure that an ES instance doesn't perform an update the moment
split-brain occurs. I am assuming that ES can react that fast to a drop
below min_master_node, but my hunch is it is tied to the zen heartbeat.

-- jim

One more thing to note, you can potentially simplify this by having
dedicated (small) instances of just master nodes (node.master set to true,
node.data set to false). For example, 3. And let the data nodes (node.master
set to false, node.data set to true) come and go as they want.

On Sat, Aug 13, 2011 at 3:19 PM, Shay Banon kimchy@gmail.com wrote:

Hi,

Yes, the minimum master nodes is the best way to guard against that. The
heartbeat to other nodes detect a node going down in an ordinal manner very
fast, and a node exiting abnormally almost as fast (broken socket). It can
be longer if the socket is left dangling.... .

Regarding changing the minimum master nodes dynamically, it can
certainly be enabled for people to set it through an API (a set cluster
level settings API is long overdue).

In theory, we can have something like automatic_minimum_master_nodes,
that can be set to something like 3-quorum, and elasticsearch will
automatically set it to either a minimum of 3 or a quorum of all eligible
master nodes in the cluster, but, this effectively means that you have a
value of 3, because you can't tell if a node is leaving because you brought
it down intentionally, or because it is dead. So its not really possible
(just throwing it out there, in case you thought about it).

On Sat, Aug 13, 2011 at 1:14 PM, James Cook jcook@tracermedia.com wrote:

Hi Paul,

We currently do set the minimum number of instances in a cluster to the
number of replicas + 1. This allows us to initialize a cluster in green
state. Of course from that point forward Amazon is creating and destroying
nodes, so using minimum_master_node cannot be predicted ahead of time.

If we get a split brain however, we might end up with a cluster in yellow
state as a side-effect of dropping below this threshold. There isn't a way
around this, but no data is lost. I would image that the split brains will
immediately start turning some replicas into primary shards to compensate
for the primary shards they have lost.

At this time, I think it would be safe to drop all of the nodes involved
in the split-brains that I will be euthanizing. There wouldn't be any reason
that I can foresee to wait for a transition to green is there?

The biggest problem is making sure that the split-brain I will be
euthanizing will not receive any updates before I can take them down or
remove them from the load balancer. I wonder if minimum_master_node could
help me in some way if it could be controlled dynamically?

For example, as AWS brings a new node online or takes one down, I set the
minimum_master_node to floor(total_nodes / 2) + 1? Maybe this would help me
make sure that an ES instance doesn't perform an update the moment
split-brain occurs. I am assuming that ES can react that fast to a drop
below min_master_node, but my hunch is it is tied to the zen heartbeat.

-- jim

One more thing to note, you can potentially simplify this by having
dedicated (small) instances of just master nodes (node.master set to true,
node.data set to false). For example, 3. And let the data nodes (node.master
set to false, node.data set to true) come and go as they want.

That's something I hadn't considered. Glad there are some alternative
approaches as well.