We've been having a recurring problem in which our server and client
nodes get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have
one server node, and one client connecting as a node client with data
set to false.
Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [61382ms]
ago, timed out [31381ms] ago, action [discovery/zen/fd/ping], node
[[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]
{client=true, data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [31383ms]
ago, timed out [1383ms] ago, action [discovery/zen/fd/ping], node
[[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]
{client=true, data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo]
received a join request for an existing node [[Gideon Mace]
[cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}]
And then our tomcat logs at the same time, where the node client is
running shows this:
2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8]
INFO - [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw]
[inet[
/10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon
Mace] master_left and no
other node elected to become master, current nodes: {[Gideon Mace]
[cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}
,}
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon
Mace] removed {[Ammo][KRS
6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-
master_failed ([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/
10.205.4.78:9300]])
2012-02-13 01:55:39,242 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon
Mace] detected_master [Am
mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added {[Ammo]
[KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-
re
ceive(from master [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/
10.205.4.78:9300]]])
However, even though it looks like the node client rejoined, they
aren't talking to each other. I have to restart the elasticsearch
server to get them talking again. One more thing is that we have a
backup process that runs on these servers, and causes some load on the
system which is what is causing the master to drop to out. But it
seems like they should be able to successfully reconnect afterwards.
Any ideas?
Thanks,
Lar