Ping timeouts, two elected master nodes, wrong metadata and deleted indices

Hi,
I am getting ping timeouts from time to time, recently the master (192.168.5.4)
did not reply fast enough (timeout is 1min), another master has been
elected (192.168.5.2)
and both stayed active.

A status request on the old master gave a NullPointerException, a
status request on the new master
gave a green status with 39 nodes (the old master has been missing,
and did not rejoin the cluster).
A _cluster/state request on the old master, gave as master still the
old one, it did not lose its master status
correctly, but the cluster as a whole seemed to be running, as all
indices were green and the rivers were indexing correctly.

After programmatically closing about 10 indices, the metadata was
messed up. It seems that there are two master nodes working
against each other. The old master updated his old version of the
state and propageted the changes,
resulting in an incorrect state with old information about shards,
which were not on the indicated nodes anymore.
It seems that to old and new master were on a race condition, about
updating the index status, as afterwards some indices have been
closed,
while others remained open. The nodes were monitoring two masters, and
accepted state from both, resulting in a restart
of the fault detection.

Afterwards the nodes started deleting the indices, which were not
assigned on them at the point where the old master lost his master
status.

My current configuration is the following:
es 0.17.10,
40 nodes,
5 master nodes,
minimum master nodes set to 3.
300 indices and 30 rivers.
shards per index: 1, replicas: 1.
unicast discovery and local gateway.

log extract from new masternode when the timeout came: (192.168.5.2)
[2012-01-27 18:08:47,921][DEBUG][zen.fd][main] [Mondo] [master] uses
ping_interval [1s], ping_timeout [1m], ping_retries [3]
[2012-01-27 18:08:47,922][DEBUG][zen.fd][main] [Mondo] [node ] uses
ping_interval [1s], ping_timeout [1m], ping_retries [3]
[2012-01-27 18:10:08,823][DEBUG][zen.fd][elasticsearch[cached]-pool-11-thread-1]
[Mondo] [master] starting fault detection against master [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}],
reason [initial_join]
[2012-01-28 05:00:33,862][DEBUG][zen.fd][elasticsearch[cached]-pool-11-thread-1047]
[Mondo] [master] failed to ping [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}],
tried [3] times, each with maximum [1m] timeout
[2012-01-28 05:00:33,863][DEBUG][zen.fd][elasticsearch[cached]-pool-11-thread-1047]
[Mondo] [master] stopping fault detection against master [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}],
reason [master failure, failed to ping, tried [3] times, each with
maximum [1m] timeout]
[2012-01-28 05:00:33,863][INFO
][discovery.zen][elasticsearch[cached]-pool-11-thread-1044] [Mondo]
master_left [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}],
reason [failed to ping, tried [3] times, each with maximum [1m]
timeout]
[2012-01-28 05:00:33,864][INFO
][cluster.service][elasticsearch[Mondo]clusterService#updateTask-pool-21-thread-1]
[Mondo] master {new
[Mondo][BTiSBXfRQWqgTI1R19WGSA][inet[/192.168.5.2:9300]]{master=true},
previous [Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}},
removed {[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true},},
reason: zen-disco-master_failed ([Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true})

while closing the indices when the state got mixed up: (192.168.5.2)
[2012-01-30 10:38:08,378][WARN ][discovery.zen][New I/O server worker
#1-7] [Mondo] master should not receive new cluster state from [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}]
[2012-01-30 10:38:08,570][WARN ][discovery.zen][New I/O server worker
#1-7] [Mondo] master should not receive new cluster state from [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}]
[2012-01-30 10:38:08,801][WARN ][discovery.zen][New I/O server worker
#1-7] [Mondo] master should not receive new cluster state from [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}]

on the other nodes it seems that there are 2 master nodes:
[2012-01-27 18:17:40,855][DEBUG][zen.fd][main] [Ev Teel Urizen]
[master] uses ping_interval [1s], ping_timeout [1m], ping_retries [3]
[2012-01-27 18:17:40,857][DEBUG][zen.fd][main] [Ev Teel Urizen] [node
] uses ping_interval [1s], ping_timeout [1m], ping_retries [3]
[2012-01-27 18:18:01,728][DEBUG][zen.fd][elasticsearch[cached]-pool-11-thread-1]
[Ev Teel Urizen] [master] starting fault detection against master
[[Abner Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}],
reason [initial_join]
[2012-01-27 18:18:07,766][INFO ][discovery][main] [Ev Teel Urizen]
trendictionsearch/a4rRT6_BSNeceIpHf9AVkw
[2012-01-28 05:06:52,403][DEBUG][zen.fd][elasticsearch[cached]-pool-11-thread-832]
[Ev Teel Urizen] [master] failed to ping [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}],
tried [3] times, each with maximum [1m] timeout
[2012-01-28 05:06:52,404][DEBUG][zen.fd][elasticsearch[cached]-pool-11-thread-832]
[Ev Teel Urizen] [master] stopping fault detection against master
[[Abner Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}],
reason [master failure, failed to ping, tried [3] times, each with
maximum [1m] timeout]
[2012-01-28 05:06:52,406][DEBUG][zen.fd][elasticsearch[Ev Teel
Urizen]clusterService#updateTask-pool-21-thread-1] [Ev Teel Urizen]
[master] restarting fault detection against master
[[Mondo][BTiSBXfRQWqgTI1R19WGSA][inet[/192.168.5.2:9300]]{master=true}],
reason [possible elected master since master left (reason = failed to
ping, tried [3] times, each with maximum [1m] timeout)]
[2012-01-28 09:26:56,244][DEBUG][zen.fd][elasticsearch[Ev Teel
Urizen]clusterService#updateTask-pool-21-thread-1] [Ev Teel Urizen]
[master] restarting fault detection against master [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}],
reason [new cluster stare received and we monitor the wrong master
[[Mondo][BTiSBXfRQWqgTI1R19WGSA][inet[/192.168.5.2:9300]]{master=true}]]
[2012-01-28 09:26:59,252][DEBUG][zen.fd][elasticsearch[Ev Teel
Urizen]clusterService#updateTask-pool-21-thread-1] [Ev Teel Urizen]
[master] restarting fault detection against master
[[Mondo][BTiSBXfRQWqgTI1R19WGSA][inet[/192.168.5.2:9300]]{master=true}],
reason [new cluster stare received and we monitor the wrong master
[[Abner Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}]]
[2012-01-28 09:26:59,258][DEBUG][zen.fd][elasticsearch[Ev Teel
Urizen]clusterService#updateTask-pool-21-thread-1] [Ev Teel Urizen]
[master] restarting fault detection against master [[Abner
Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}],
reason [new cluster stare received and we monitor the wrong master
[[Mondo][BTiSBXfRQWqgTI1R19WGSA][inet[/192.168.5.2:9300]]{master=true}]]
[2012-01-28 09:26:59,260][DEBUG][zen.fd][elasticsearch[Ev Teel
Urizen]clusterService#updateTask-pool-21-thread-1] [Ev Teel Urizen]
[master] restarting fault detection against master
[[Mondo][BTiSBXfRQWqgTI1R19WGSA][inet[/192.168.5.2:9300]]{master=true}],
reason [new cluster stare received and we monitor the wrong master
[[Abner Little][-kmiCv3gTpqGqBmsgOBe0A][inet[/192.168.5.4:9300]]{master=true}]]

afterwards on the datanodes themselves, data is being deleted (taking
the wrong state into account):
[2012-01-30 10:50:56,201][DEBUG][indices.store][elasticsearch[Gorilla
Girl]clusterService#updateTask-pool-21-thread-1] [Gorilla Girl]
[search_index_356] deleting index that is no longer in the cluster
meta_date

Wanting to update to es 0.18, I am wondering if the same behaviour is
still present in 0.18.
The question here is that:

  • why does the old master not rejoin the cluster correctly (my best
    bet would be that he still things he is part of a cluster of which he
    still is the master, which would also explain why he updates the
    state).
  • why does the old master not lose its master status, when the other
    one is elected? Is the faultdetection behaving correctly when a node
    only temporarily goes down under load?
  • I thought that a node could only see one elected master node, and
    that the setting minimum_master_nodes set to 3 when there are 5 master
    nodes, would make it impossible for 2 master nodes
    to be active at the same time.
  • why are the timeouts on the ping arriving so regularly, can it be
    that it is some threading issue, like it was with the starting of the
    rivers after a full cluster restart, as we are using many indices,
    many nodes and many rivers.

Also is it possible to not delete the indices locally on the nodes, so
that I might have a chance to recover the deleted indices, by copying
them over to the node were they are expected to be found.

Thanks,
Michel

We have seen a similar situation resulting in two masters using
0.18.6. We have 5 servers with master/data nodes and 10 servers with
client nodes.

Master was #5. I assume there was some type of network glitch, but we
aren't sure exactly what happened. Then #2 took over as master and
#1, #3, #4 and most of the client nodes (but not all) joined it.

It seemed that #5 was still able to ping #1 and #4 so it thought it
should stay master. Using the head plugin we could tell that #5 was
trying to tell #1 and #4 to reallocate a shard. Though, since they
were actually part of #2's cluster they seemed to ignore the command,
because it stayed in that state until we bounced #5.

I am guessing that the "ping" is not smart enough. It knows if
another node is alive, but maybe it doesn't respond with any cluster
info to know that the another node has claimed to be master???

Hi,

Yea, this can happen. One way to help mitigate it is by setting the minimum_master_nodes in the configuration to a higher value (3 for a 5 node cluster, or 7 in a 10 node cluster), but it might still happen in complex disconnection scenarios (one way disconnect). One of the things left to add to zen discovery is automatic resolution of such scenarios, instead of having to go and bounce a specific node.

-shay.banon

On Wednesday, February 1, 2012 at 12:56 AM, drew.letcher wrote:

We have seen a similar situation resulting in two masters using
0.18.6. We have 5 servers with master/data nodes and 10 servers with
client nodes.

Master was #5. I assume there was some type of network glitch, but we
aren't sure exactly what happened. Then #2 took over as master and
#1, #3, #4 and most of the client nodes (but not all) joined it.

It seemed that #5 was still able to ping #1 and #4 so it thought it
should stay master. Using the head plugin we could tell that #5 was
trying to tell #1 and #4 to reallocate a shard. Though, since they
were actually part of #2's cluster they seemed to ignore the command,
because it stayed in that state until we bounced #5.

I am guessing that the "ping" is not smart enough. It knows if
another node is alive, but maybe it doesn't respond with any cluster
info to know that the another node has claimed to be master???

Hi,
the question is if there is a way to prevent the "wrong" master to
send updates to
the other nodes, resulting in an inconsistent state, finally leading
to data loss, when
the nodes are deleting data, that they think will not be used anymore.

When we had the situation the nodes themselves switched masters between
the correct new one and the old one (new cluster stare received and we
monitor the wrong master).
Afterwards while closing some indices, the old master received some of
the updates,
updated his wrong state and propagated the changes.

So, yes the question would be if there is a possibility to detect and
prevent a situation
with two master nodes working against each other.

Best,
Michel

On Wed, Feb 1, 2012 at 10:45 AM, Shay Banon kimchy@gmail.com wrote:

Hi,

Yea, this can happen. One way to help mitigate it is by setting the
minimum_master_nodes in the configuration to a higher value (3 for a 5 node
cluster, or 7 in a 10 node cluster), but it might still happen in complex
disconnection scenarios (one way disconnect). One of the things left to add
to zen discovery is automatic resolution of such scenarios, instead of
having to go and bounce a specific node.

-shay.banon

On Wednesday, February 1, 2012 at 12:56 AM, drew.letcher wrote:

We have seen a similar situation resulting in two masters using
0.18.6. We have 5 servers with master/data nodes and 10 servers with
client nodes.

Master was #5. I assume there was some type of network glitch, but we
aren't sure exactly what happened. Then #2 took over as master and
#1, #3, #4 and most of the client nodes (but not all) joined it.

It seemed that #5 was still able to ping #1 and #4 so it thought it
should stay master. Using the head plugin we could tell that #5 was
trying to tell #1 and #4 to reallocate a shard. Though, since they
were actually part of #2's cluster they seemed to ignore the command,
because it stayed in that state until we bounced #5.

I am guessing that the "ping" is not smart enough. It knows if
another node is alive, but maybe it doesn't respond with any cluster
info to know that the another node has claimed to be master???

Hi Shay,

do you you recommend that we set
minimum_master_nodes to (number of cluster nodes/2) in order to avoid this
bug?

Thanks,
Thibaut

On Wed, Feb 1, 2012 at 11:20 AM, Michel Conrad <
michel.conrad@trendiction.com> wrote:

Hi,
the question is if there is a way to prevent the "wrong" master to
send updates to
the other nodes, resulting in an inconsistent state, finally leading
to data loss, when
the nodes are deleting data, that they think will not be used anymore.

When we had the situation the nodes themselves switched masters between
the correct new one and the old one (new cluster stare received and we
monitor the wrong master).
Afterwards while closing some indices, the old master received some of
the updates,
updated his wrong state and propagated the changes.

So, yes the question would be if there is a possibility to detect and
prevent a situation
with two master nodes working against each other.

Best,
Michel

On Wed, Feb 1, 2012 at 10:45 AM, Shay Banon kimchy@gmail.com wrote:

Hi,

Yea, this can happen. One way to help mitigate it is by setting the
minimum_master_nodes in the configuration to a higher value (3 for a 5
node
cluster, or 7 in a 10 node cluster), but it might still happen in complex
disconnection scenarios (one way disconnect). One of the things left to
add
to zen discovery is automatic resolution of such scenarios, instead of
having to go and bounce a specific node.

-shay.banon

On Wednesday, February 1, 2012 at 12:56 AM, drew.letcher wrote:

We have seen a similar situation resulting in two masters using
0.18.6. We have 5 servers with master/data nodes and 10 servers with
client nodes.

Master was #5. I assume there was some type of network glitch, but we
aren't sure exactly what happened. Then #2 took over as master and
#1, #3, #4 and most of the client nodes (but not all) joined it.

It seemed that #5 was still able to ping #1 and #4 so it thought it
should stay master. Using the head plugin we could tell that #5 was
trying to tell #1 and #4 to reallocate a shard. Though, since they
were actually part of #2's cluster they seemed to ignore the command,
because it stayed in that state until we bounced #5.

I am guessing that the "ping" is not smart enough. It knows if
another node is alive, but maybe it doesn't respond with any cluster
info to know that the another node has claimed to be master???

In my scenario node #1 and #4 should know that there are two masters
for the same cluster name. They should force a re-election of a new
master. This would minimize any loss of data, if it happens quick
enough.

Sounds like Shay is aware of the issue. We'll look forward to this
enhancement in a future release.

Thanks,
Drew

BTW, our minimum_master_nodes is set to 3.

A couple more notes.

We have redundant importers, but still lost data when this scenario
occured. Luckily we planned for situations like this that cause gaps
in the data. We used our recovery process to reindex the missing
documents.

Also, it took us a couple of hours to notice that node #5 had gone
rogue. All others servers reported cluster health as green. We
monitor the health of each node, but only monitor cluster health on
one node. To detect this situation we should add monitors for cluster
health on each node.