Client node rejoins but isn't communicating with the server

Lar_Mader · February 13, 2012, 7:37pm

We've been having a recurring problem in which our server and client
nodes get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have
one server node, and one client connecting as a node client with data
set to false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [61382ms]
ago, timed out [31381ms] ago, action [discovery/zen/fd/ping], node
[[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]
{client=true, data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [31383ms]
ago, timed out [1383ms] ago, action [discovery/zen/fd/ping], node
[[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]
{client=true, data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo]
received a join request for an existing node [[Gideon Mace]
[cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}]

And then our tomcat logs at the same time, where the node client is
running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8]
INFO - [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw]
[inet[
/10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon
Mace] master_left and no
other node elected to become master, current nodes: {[Gideon Mace]
[cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}
,}
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon
Mace] removed {[Ammo][KRS
6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-
master_failed ([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/
10.205.4.78:9300]])
2012-02-13 01:55:39,242 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon
Mace] detected_master [Am
mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added {[Ammo]
[KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-
re
ceive(from master [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/
10.205.4.78:9300]]])

However, even though it looks like the node client rejoined, they
aren't talking to each other. I have to restart the elasticsearch
server to get them talking again. One more thing is that we have a
backup process that runs on these servers, and causes some load on the
system which is what is causing the master to drop to out. But it
seems like they should be able to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar

Lar_Mader · February 13, 2012, 9:20pm

Oops, sorry for the double post!

Lar_Mader · February 14, 2012, 12:34am

Ok, more info.

I can repro the problem by simply stopping the network interface on the
elastic server, waiting until the client notices that the master has left,
and then starting the network back up. At this point the client joins the
cluster, but searches from the client fail. I get an index missing
exception from the client. Restarting either the client or the server fixes
it.

Is this a known problem, or an issue with the node client. Would the
transport client handle this better?

Thanks,
Lar

Topic		Replies	Views
Client node rejoins but isn't communicating with the server Elasticsearch	9	477	July 6, 2017
ElasticSearch 0.92 issue when stop Client Node Elasticsearch	1	331	July 6, 2017
Node not connected Elasticsearch	4	11896	July 6, 2017
Elasticsearch 6.1.3 -- failed to discover master after node restart Elasticsearch	6	1240	April 27, 2018
Network interruption, some nodes not recovering Elasticsearch	1	356	July 6, 2017

Client node rejoins but isn't communicating with the server

Related topics