We've been having a recurring problem in which our server and client nodes
get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have one
server node, and one client connecting as a node client with data set to
false.
Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo] Received
response for a request that has timed out, sent [61382ms] ago, timed out
[31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo] Received
response for a request that has timed out, sent [31383ms] ago, timed out
[1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo] received
a join request for an existing node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}]
And then our tomcat logs at the same time, where the node client is running
shows this:
2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8] INFO -
[Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[
/10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each with
maximum [30s] timeout]
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon Mace]
master_left and no
other node elected to become master, current nodes: {[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}
,}
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
removed {[Ammo][KRS
6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
zen-disco-master_failed
([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]])
2012-02-13 01:55:39,242 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
detected_master [Am
mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added
{[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
zen-disco-re
ceive(from master [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]]])
So even though it looks like the node client rejoined, they aren't talking
to each other. I have to restart the elasticsearch server to get them
talking again. One more thing is that we have a backup process that runs
on these servers, and causes some load on the system which is what is
causing the master to drop to out. But it seems like they should be able
to successfully reconnect afterwards.
Any ideas?
Thanks,
Lar