Zen ping timeout causes nodes to lose master permanently

Summary: I am having problems with nodes losing the master from their
node list. This is causing health to be reported differently on
different nodes, and it appears that shard replicas are getting into
divergent states when clients write to different nodes.

Long version:

We have a backup process on one node that runs nightly. Apparently
that causes GC pressure making gc collections take longer than normal:

/var/log/elasticsearch/production.log.2010-08-21:[11:16:23,000][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [42.5s], breached threshold [10s]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,382][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [55.1s], breached threshold [10s]

Which is fine, our load is very low at that time. The problem is that
this machine happens to be the master node. On our 2 other nodes we
get these log lines:

/var/log/elasticsearch/production.log.2010-08-21:[11:16:20,263][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,371][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

/var/log/elasticsearch/production.log.2010-08-21:[11:16:21,757][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,376][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

So the ping times out while the master has high load. But now the non-
master nodes are in a weird state and don't seem to recover from it.
They don't list the master in their nodes list, and the cluster health
reports only 2 active nodes and only knows about the shards on those
nodes. The master node still knows about all 3 nodes, and reports
health as if all shards from all 3 nodes are available.

Is elasticsearch built to handle ping timeouts like this? I'm bringing
it up here because I suspect it is, and this might be a bug.

Also, after running in this state for a day or two, some of the shards
had document counts that are different by a few dozen, presumably
because they are not replicating changes to the shards they aren't
aware of. We were able to fix this issue by doing a rolling restart of
the cluster, no downtime required. That was pretty cool.

Hi,

Yes, what happened here is that the cluster got into a split brain.
Currently, the way elasticsearch handles a split brain is by not trying to
join these two cluster back into a single cluster. The main problem with
split brain is when you still have clients working against these two
different clusters.

I plan to be able to handle better split brain scenarios. One option is
to have the two different clusters try and join once such a scenario
happens. Another option is to have the smaller cluster kill itself once
something like this happens.

Resolving this is currently a manual process that you did, which is
basically a restart of the smaller cluster.

-shay.banon

On Tue, Aug 24, 2010 at 2:03 AM, Grant Rodgers grantr@gmail.com wrote:

Summary: I am having problems with nodes losing the master from their
node list. This is causing health to be reported differently on
different nodes, and it appears that shard replicas are getting into
divergent states when clients write to different nodes.

Long version:

We have a backup process on one node that runs nightly. Apparently
that causes GC pressure making gc collections take longer than normal:

/var/log/elasticsearch/production.log.2010-08-21:[11:16:23,000][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [42.5s], breached threshold [10s]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,382][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [55.1s], breached threshold [10s]

Which is fine, our load is very low at that time. The problem is that
this machine happens to be the master node. On our 2 other nodes we
get these log lines:

/var/log/elasticsearch/production.log.2010-08-21:[11:16:20,263][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,371][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

/var/log/elasticsearch/production.log.2010-08-21:[11:16:21,757][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,376][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

So the ping times out while the master has high load. But now the non-
master nodes are in a weird state and don't seem to recover from it.
They don't list the master in their nodes list, and the cluster health
reports only 2 active nodes and only knows about the shards on those
nodes. The master node still knows about all 3 nodes, and reports
health as if all shards from all 3 nodes are available.

Is elasticsearch built to handle ping timeouts like this? I'm bringing
it up here because I suspect it is, and this might be a bug.

Also, after running in this state for a day or two, some of the shards
had document counts that are different by a few dozen, presumably
because they are not replicating changes to the shards they aren't
aware of. We were able to fix this issue by doing a rolling restart of
the cluster, no downtime required. That was pretty cool.

Right, I guess network partitions are pretty hard to detect and repair
in clustered systems. We'll just do the rolling restart if this
happens again, it was pretty painless.

On Aug 23, 4:13 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Yes, what happened here is that the cluster got into a split brain.
Currently, the way elasticsearch handles a split brain is by not trying to
join these two cluster back into a single cluster. The main problem with
split brain is when you still have clients working against these two
different clusters.

I plan to be able to handle better split brain scenarios. One option is
to have the two different clusters try and join once such a scenario
happens. Another option is to have the smaller cluster kill itself once
something like this happens.

Resolving this is currently a manual process that you did, which is
basically a restart of the smaller cluster.

-shay.banon

On Tue, Aug 24, 2010 at 2:03 AM, Grant Rodgers gra...@gmail.com wrote:

Summary: I am having problems with nodes losing the master from their
node list. This is causing health to be reported differently on
different nodes, and it appears that shard replicas are getting into
divergent states when clients write to different nodes.

Long version:

We have a backup process on one node that runs nightly. Apparently
that causes GC pressure making gc collections take longer than normal:

/var/log/elasticsearch/production.log.2010-08-21:[11:16:23,000][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [42.5s], breached threshold [10s]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,382][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [55.1s], breached threshold [10s]

Which is fine, our load is very low at that time. The problem is that
this machine happens to be the master node. On our 2 other nodes we
get these log lines:

/var/log/elasticsearch/production.log.2010-08-21:[11:16:20,263][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,371][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

/var/log/elasticsearch/production.log.2010-08-21:[11:16:21,757][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,376][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

So the ping times out while the master has high load. But now the non-
master nodes are in a weird state and don't seem to recover from it.
They don't list the master in their nodes list, and the cluster health
reports only 2 active nodes and only knows about the shards on those
nodes. The master node still knows about all 3 nodes, and reports
health as if all shards from all 3 nodes are available.

Is elasticsearch built to handle ping timeouts like this? I'm bringing
it up here because I suspect it is, and this might be a bug.

Also, after running in this state for a day or two, some of the shards
had document counts that are different by a few dozen, presumably
because they are not replicating changes to the shards they aren't
aware of. We were able to fix this issue by doing a rolling restart of
the cluster, no downtime required. That was pretty cool.

There are different ways to handle network partitioning, none of them is
really good. The most popular solution, which is the dynamo model (though in
it, eventual consistent should be called: eventual consistent with a chance
of loosing data) is not really applicable to how elasticsearch works (or
search engines for that matter). It can be implemented, it will just take so
many resources out of the system in order to try and implement it, that it
will render it useless.

There are other ways. For example, requiring a quorum of a known cluster in
order to accepts "writes", and then automatically rejoining a cluster
network partitioning has been resolved.

If I remember correctly, you are running on ec2, so I suggest, in any case,
increasing the failure detector timeouts (currently implementing simple
polling FD, other implementations will come later):

discovery.zen.fd.ping_timeout: 1m
discovery.zen.fd.ping_retries: 5

This will configure a 5 minutes timeout.

-shay.banon

On Tue, Aug 24, 2010 at 2:49 AM, Grant Rodgers grantr@gmail.com wrote:

Right, I guess network partitions are pretty hard to detect and repair
in clustered systems. We'll just do the rolling restart if this
happens again, it was pretty painless.

On Aug 23, 4:13 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Yes, what happened here is that the cluster got into a split brain.
Currently, the way elasticsearch handles a split brain is by not trying
to
join these two cluster back into a single cluster. The main problem with
split brain is when you still have clients working against these two
different clusters.

I plan to be able to handle better split brain scenarios. One option
is
to have the two different clusters try and join once such a scenario
happens. Another option is to have the smaller cluster kill itself once
something like this happens.

Resolving this is currently a manual process that you did, which is
basically a restart of the smaller cluster.

-shay.banon

On Tue, Aug 24, 2010 at 2:03 AM, Grant Rodgers gra...@gmail.com wrote:

Summary: I am having problems with nodes losing the master from their
node list. This is causing health to be reported differently on
different nodes, and it appears that shard replicas are getting into
divergent states when clients write to different nodes.

Long version:

We have a backup process on one node that runs nightly. Apparently
that causes GC pressure making gc collections take longer than normal:

/var/log/elasticsearch/production.log.2010-08-21:[11:16:23,000][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [42.5s], breached threshold [10s]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,382][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [55.1s], breached threshold [10s]

Which is fine, our load is very low at that time. The problem is that
this machine happens to be the master node. On our 2 other nodes we
get these log lines:

/var/log/elasticsearch/production.log.2010-08-21:[11:16:20,263][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,371][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

/var/log/elasticsearch/production.log.2010-08-21:[11:16:21,757][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,376][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]

So the ping times out while the master has high load. But now the non-
master nodes are in a weird state and don't seem to recover from it.
They don't list the master in their nodes list, and the cluster health
reports only 2 active nodes and only knows about the shards on those
nodes. The master node still knows about all 3 nodes, and reports
health as if all shards from all 3 nodes are available.

Is elasticsearch built to handle ping timeouts like this? I'm bringing
it up here because I suspect it is, and this might be a bug.

Also, after running in this state for a day or two, some of the shards
had document counts that are different by a few dozen, presumably
because they are not replicating changes to the shards they aren't
aware of. We were able to fix this issue by doing a rolling restart of
the cluster, no downtime required. That was pretty cool.