Client node rejoins but isn't communicating with the server


(Lar Mader) #1

We've been having a recurring problem in which our server and client
nodes get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have
one server node, and one client connecting as a node client with data
set to false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [61382ms]
ago, timed out [31381ms] ago, action [discovery/zen/fd/ping], node
[[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]
{client=true, data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [31383ms]
ago, timed out [1383ms] ago, action [discovery/zen/fd/ping], node
[[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]
{client=true, data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo]
received a join request for an existing node [[Gideon Mace]
[cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}]

And then our tomcat logs at the same time, where the node client is
running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8]
INFO - [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw]
[inet[
/10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon
Mace] master_left and no
other node elected to become master, current nodes: {[Gideon Mace]
[cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}
,}
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon
Mace] removed {[Ammo][KRS
6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-
master_failed ([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/
10.205.4.78:9300]])
2012-02-13 01:55:39,242 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon
Mace] detected_master [Am
mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added {[Ammo]
[KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-
re
ceive(from master [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/
10.205.4.78:9300]]])

However, even though it looks like the node client rejoined, they
aren't talking to each other. I have to restart the elasticsearch
server to get them talking again. One more thing is that we have a
backup process that runs on these servers, and causes some load on the
system which is what is causing the master to drop to out. But it
seems like they should be able to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(Lar Mader) #2

Oops, sorry for the double post!


(Lar Mader) #3

Ok, more info.

I can repro the problem by simply stopping the network interface on the
elastic server, waiting until the client notices that the master has left,
and then starting the network back up. At this point the client joins the
cluster, but searches from the client fail. I get an index missing
exception from the client. Restarting either the client or the server fixes
it.

Is this a known problem, or an issue with the node client. Would the
transport client handle this better?

Thanks,
Lar


(system) #4