Summary: I am having problems with nodes losing the master from their
node list. This is causing health to be reported differently on
different nodes, and it appears that shard replicas are getting into
divergent states when clients write to different nodes.
Long version:
We have a backup process on one node that runs nightly. Apparently
that causes GC pressure making gc collections take longer than normal:
/var/log/elasticsearch/production.log.2010-08-21:[11:16:23,000][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [42.5s], breached threshold [10s]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,382][WARN ]
[monitor.jvm ] [Nocturne] Long GC collection occurred,
took [55.1s], breached threshold [10s]
Which is fine, our load is very low at that time. The problem is that
this machine happens to be the master node. On our 2 other nodes we
get these log lines:
/var/log/elasticsearch/production.log.2010-08-21:[11:16:20,263][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,371][WARN ]
[transport ] [Devil-Slayer] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]
/var/log/elasticsearch/production.log.2010-08-21:[11:16:21,757][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]
/var/log/elasticsearch/production.log.2010-08-21:[11:37:22,376][WARN ]
[transport ] [Noh-Varr] Transport response handler
timed out, action [discovery/zen/fd/masterPing], node [[Nocturne]
[b2d9e7a8-6f6f-497d-903c-b564116e85d5][inet[/10.102.43.160:9300]]]
So the ping times out while the master has high load. But now the non-
master nodes are in a weird state and don't seem to recover from it.
They don't list the master in their nodes list, and the cluster health
reports only 2 active nodes and only knows about the shards on those
nodes. The master node still knows about all 3 nodes, and reports
health as if all shards from all 3 nodes are available.
Is elasticsearch built to handle ping timeouts like this? I'm bringing
it up here because I suspect it is, and this might be a bug.
Also, after running in this state for a day or two, some of the shards
had document counts that are different by a few dozen, presumably
because they are not replicating changes to the shards they aren't
aware of. We were able to fix this issue by doing a rolling restart of
the cluster, no downtime required. That was pretty cool.