Client node rejoins but isn't communicating with the server


(Lar Mader) #1

We've been having a recurring problem in which our server and client nodes
get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have one
server node, and one client connecting as a node client with data set to
false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo] Received
response for a request that has timed out, sent [61382ms] ago, timed out
[31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo] Received
response for a request that has timed out, sent [31383ms] ago, timed out
[1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo] received
a join request for an existing node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}]

And then our tomcat logs at the same time, where the node client is running
shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8] INFO -
[Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[
/10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each with
maximum [30s] timeout]
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon Mace]
master_left and no
other node elected to become master, current nodes: {[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}
,}
2012-02-13 01:55:36,228 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
removed {[Ammo][KRS
6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
zen-disco-master_failed
([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]])
2012-02-13 01:55:39,242 [elasticsearch[Gideon
Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
detected_master [Am
mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added
{[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
zen-disco-re
ceive(from master [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]]])

So even though it looks like the node client rejoined, they aren't talking
to each other. I have to restart the elasticsearch server to get them
talking again. One more thing is that we have a backup process that runs
on these servers, and causes some load on the system which is what is
causing the master to drop to out. But it seems like they should be able
to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(Shay Banon) #2

They should restore the connection, yes. Can you easily repro this? If so, can you try it with 0.19.RC2, there have been some enhancements there with the mentioned scenario.

On Monday, February 13, 2012 at 9:08 PM, Lar Mader wrote:

We've been having a recurring problem in which our server and client nodes get disconnected, seem to rejoin successfully, but are no longer communicating with each other. Our setup is pretty simple - we have one server node, and one client connecting as a node client with data set to false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo] Received response for a request that has timed out, sent [61382ms] ago, timed out [31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true, data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo] Received response for a request that has timed out, sent [31383ms] ago, timed out [1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true, data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo] received a join request for an existing node [[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true, data=false}]

And then our tomcat logs at the same time, where the node client is running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8] INFO - [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[
/10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
2012-02-13 01:55:36,228 [elasticsearch[Gideon Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon Mace] master_left and no
other node elected to become master, current nodes: {[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true, data=false}
,}
2012-02-13 01:55:36,228 [elasticsearch[Gideon Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace] removed {[Ammo][KRS
6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-master_failed ([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]])
2012-02-13 01:55:39,242 [elasticsearch[Gideon Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace] detected_master [Am
mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added {[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-re
ceive(from master [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]]])

So even though it looks like the node client rejoined, they aren't talking to each other. I have to restart the elasticsearch server to get them talking again. One more thing is that we have a backup process that runs on these servers, and causes some load on the system which is what is causing the master to drop to out. But it seems like they should be able to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(Lar Mader) #3

I have verified that the TransportClient doesn't suffer from this problem,
which is probably as expected. Only the NodeClient has this problem, so It
seems to have to do with rejoining a cluster after a network disruption
long enough to exceed the ping timeouts/retries.

I am currently using 0.18.7; I haven't yet tried it on 0.19.RC2.

Lar

On Tue, Feb 14, 2012 at 6:13 AM, Shay Banon kimchy@gmail.com wrote:

They should restore the connection, yes. Can you easily repro this? If
so, can you try it with 0.19.RC2, there have been some enhancements there
with the mentioned scenario.

On Monday, February 13, 2012 at 9:08 PM, Lar Mader wrote:

We've been having a recurring problem in which our server and client nodes
get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have one
server node, and one client connecting as a node client with data set to
false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [61382ms] ago,
timed out [31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [31383ms] ago,
timed out [1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo]
received a join request for an existing node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}]

And then our tomcat logs at the same time, where the node client is
running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8] INFO

  • [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[
    /10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each with
    maximum [30s] timeout]
    2012-02-13 01:55:36,228 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon Mace]
    master_left and no
    other node elected to become master, current nodes: {[Gideon
    Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
    data=false}
    ,}
    2012-02-13 01:55:36,228 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
    removed {[Ammo][KRS
    6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
    zen-disco-master_failed
    ([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]])
    2012-02-13 01:55:39,242 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
    detected_master [Am
    mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added
    {[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
    zen-disco-re
    ceive(from master [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300
    ]]])

So even though it looks like the node client rejoined, they aren't talking
to each other. I have to restart the elasticsearch server to get them
talking again. One more thing is that we have a backup process that runs
on these servers, and causes some load on the system which is what is
causing the master to drop to out. But it seems like they should be able
to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(Frederic) #4

Hi Kimchy,

I've been doing some local tests regarding this using ES 0.19.0 and found
that, having couple of threads with one Transport client each continously
indexing data, if I shutdown ES I see the expected
NoNodeAvailableException's, but when I start it again, the exceptions stop
(as if the server was detected) but the clients hang and don't index any
more data.

If I restart the client it works again. Is it possible that, once you
create a client it stays bounded to the current node name and cannot
operate with a new one?

I'll try to recreate the scenario in a simpler test but now that I read
this post, I realized that I've seen this behavior different times across
different ES versions in our cluster, and we do have a similar backup
process that could affect master response (always though it was kind of
network problem)

Thanks,

Frederic

On Tuesday, 14 February 2012 11:13:29 UTC-3, kimchy wrote:

They should restore the connection, yes. Can you easily repro this? If
so, can you try it with 0.19.RC2, there have been some enhancements there
with the mentioned scenario.

On Monday, February 13, 2012 at 9:08 PM, Lar Mader wrote:

We've been having a recurring problem in which our server and client nodes
get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have one
server node, and one client connecting as a node client with data set to
false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [61382ms] ago,
timed out [31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [31383ms] ago,
timed out [1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo]
received a join request for an existing node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}]

And then our tomcat logs at the same time, where the node client is
running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8] INFO

  • [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[
    /10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each with
    maximum [30s] timeout]
    2012-02-13 01:55:36,228 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon Mace]
    master_left and no
    other node elected to become master, current nodes: {[Gideon
    Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
    data=false}
    ,}
    2012-02-13 01:55:36,228 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
    removed {[Ammo][KRS
    6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
    zen-disco-master_failed
    ([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]])
    2012-02-13 01:55:39,242 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
    detected_master [Am
    mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added
    {[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
    zen-disco-re
    ceive(from master
    [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]]])

So even though it looks like the node client rejoined, they aren't talking
to each other. I have to restart the elasticsearch server to get them
talking again. One more thing is that we have a backup process that runs
on these servers, and causes some load on the system which is what is
causing the master to drop to out. But it seems like they should be able
to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(Shay Banon) #5

Maybe the index requests just wait for the relevant shards to become available? By default they will wait for 1 minute to do that (the timeout). Anyway, they should not become stuck, if you can recreate it, it would be great.

On Thursday, March 8, 2012 at 5:54 PM, Frederic wrote:

Hi Kimchy,

I've been doing some local tests regarding this using ES 0.19.0 and found that, having couple of threads with one Transport client each continously indexing data, if I shutdown ES I see the expected NoNodeAvailableException's, but when I start it again, the exceptions stop (as if the server was detected) but the clients hang and don't index any more data.

If I restart the client it works again. Is it possible that, once you create a client it stays bounded to the current node name and cannot operate with a new one?

I'll try to recreate the scenario in a simpler test but now that I read this post, I realized that I've seen this behavior different times across different ES versions in our cluster, and we do have a similar backup process that could affect master response (always though it was kind of network problem)

Thanks,

Frederic

On Tuesday, 14 February 2012 11:13:29 UTC-3, kimchy wrote:

They should restore the connection, yes. Can you easily repro this? If so, can you try it with 0.19.RC2, there have been some enhancements there with the mentioned scenario.

On Monday, February 13, 2012 at 9:08 PM, Lar Mader wrote:

We've been having a recurring problem in which our server and client nodes get disconnected, seem to rejoin successfully, but are no longer communicating with each other. Our setup is pretty simple - we have one server node, and one client connecting as a node client with data set to false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo] Received response for a request that has timed out, sent [61382ms] ago, timed out [31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true, data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo] Received response for a request that has timed out, sent [31383ms] ago, timed out [1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true, data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo] received a join request for an existing node [[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true, data=false}]

And then our tomcat logs at the same time, where the node client is running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8] INFO - [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[
/10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
2012-02-13 01:55:36,228 [elasticsearch[Gideon Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon Mace] master_left and no
other node elected to become master, current nodes: {[Gideon Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true, data=false}
,}
2012-02-13 01:55:36,228 [elasticsearch[Gideon Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace] removed {[Ammo][KRS
6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-master_failed ([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]])
2012-02-13 01:55:39,242 [elasticsearch[Gideon Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace] detected_master [Am
mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added {[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason: zen-disco-re
ceive(from master [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]]])

So even though it looks like the node client rejoined, they aren't talking to each other. I have to restart the elasticsearch server to get them talking again. One more thing is that we have a backup process that runs on these servers, and causes some load on the system which is what is causing the master to drop to out. But it seems like they should be able to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(Frederic) #6

Hi Kimchy,

I haven't been able yet to reproduce in a simpler test this issue, but
today we faced the following similar situation:

The disk of one of the nodes failed so ES wasn't able to write on it. Our
index worker is a TranportClient with that, among others, server configured
as a transport host.
Right after the failure, the worker halted. I shutted down that node, the
worker was still inactive. Restarted several times the worker and it always
halted when trying to create the transport client.
I actually ended up shutting down and restarting the whole cluster, always
with the conflictive server down, and the worker still halted when creating
the Transport nodes.

So, finally I deployed a new version of the client removing the inactive
host from the configuration... and voila, it started and indexed docs
properly

Am I missing something regarding Tranport client behavior?

Cheer,

On Thursday, 8 March 2012 19:28:58 UTC-3, kimchy wrote:

Maybe the index requests just wait for the relevant shards to become
available? By default they will wait for 1 minute to do that (the timeout).
Anyway, they should not become stuck, if you can recreate it, it would be
great.

On Thursday, March 8, 2012 at 5:54 PM, Frederic wrote:

Hi Kimchy,

I've been doing some local tests regarding this using ES 0.19.0 and found
that, having couple of threads with one Transport client each continously
indexing data, if I shutdown ES I see the expected
NoNodeAvailableException's, but when I start it again, the exceptions stop
(as if the server was detected) but the clients hang and don't index any
more data.

If I restart the client it works again. Is it possible that, once you
create a client it stays bounded to the current node name and cannot
operate with a new one?

I'll try to recreate the scenario in a simpler test but now that I read
this post, I realized that I've seen this behavior different times across
different ES versions in our cluster, and we do have a similar backup
process that could affect master response (always though it was kind of
network problem)

Thanks,

Frederic

On Tuesday, 14 February 2012 11:13:29 UTC-3, kimchy wrote:

They should restore the connection, yes. Can you easily repro this? If
so, can you try it with 0.19.RC2, there have been some enhancements there
with the mentioned scenario.

On Monday, February 13, 2012 at 9:08 PM, Lar Mader wrote:

We've been having a recurring problem in which our server and client nodes
get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have one
server node, and one client connecting as a node client with data set to
false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [61382ms] ago,
timed out [31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [31383ms] ago,
timed out [1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo]
received a join request for an existing node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}]

And then our tomcat logs at the same time, where the node client is
running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8] INFO

  • [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[
    /10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each with
    maximum [30s] timeout]
    2012-02-13 01:55:36,228 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon Mace]
    master_left and no
    other node elected to become master, current nodes: {[Gideon
    Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
    data=false}
    ,}
    2012-02-13 01:55:36,228 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
    removed {[Ammo][KRS
    6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
    zen-disco-master_failed
    ([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]])
    2012-02-13 01:55:39,242 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
    detected_master [Am
    mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added
    {[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
    zen-disco-re
    ceive(from master
    [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]]])

So even though it looks like the node client rejoined, they aren't talking
to each other. I have to restart the elasticsearch server to get them
talking again. One more thing is that we have a backup process that runs
on these servers, and causes some load on the system which is what is
causing the master to drop to out. But it seems like they should be able
to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(Frederic) #7

Some additional info based on a unit test for a custom ES clients factory
(https://gist.github.com/2050841):

The test creates either node or tranport client based on some config.
Creating a Transport one, it halted when adding 'localhost' as host and no
ES was running.

So, as far as I can see, when a TransportClient is created, and then some
hosts are added using

client.addTransportAddress(new InetSocketTransportAddress(host, port))

the client tries to discover an ES instance running on that hosts and waits
for a given time. It doesn't seem to be the default
discovery.zen.ping.timeout (I'm not defining that value when creating the
client) as it is 3sec and my client waited for really long time

Thus, I infer that, if one of servers is down, when adding it as a host to
the TC, the client waits for that server to be up, no mather if the others
are actually running, is this possible? If so, what would be the approach
to avoid this situation or set the discovery timeout to a lower value?

Thanks,

On Wednesday, 14 March 2012 16:55:33 UTC-3, Frederic wrote:

Hi Kimchy,

I haven't been able yet to reproduce in a simpler test this issue, but
today we faced the following similar situation:

The disk of one of the nodes failed so ES wasn't able to write on it. Our
index worker is a TranportClient with that, among others, server configured
as a transport host.
Right after the failure, the worker halted. I shutted down that node, the
worker was still inactive. Restarted several times the worker and it always
halted when trying to create the transport client.
I actually ended up shutting down and restarting the whole cluster, always
with the conflictive server down, and the worker still halted when creating
the Transport nodes.

So, finally I deployed a new version of the client removing the inactive
host from the configuration... and voila, it started and indexed docs
properly

Am I missing something regarding Tranport client behavior?

Cheer,

On Thursday, 8 March 2012 19:28:58 UTC-3, kimchy wrote:

Maybe the index requests just wait for the relevant shards to become
available? By default they will wait for 1 minute to do that (the timeout).
Anyway, they should not become stuck, if you can recreate it, it would be
great.

On Thursday, March 8, 2012 at 5:54 PM, Frederic wrote:

Hi Kimchy,

I've been doing some local tests regarding this using ES 0.19.0 and found
that, having couple of threads with one Transport client each continously
indexing data, if I shutdown ES I see the expected
NoNodeAvailableException's, but when I start it again, the exceptions stop
(as if the server was detected) but the clients hang and don't index any
more data.

If I restart the client it works again. Is it possible that, once you
create a client it stays bounded to the current node name and cannot
operate with a new one?

I'll try to recreate the scenario in a simpler test but now that I read
this post, I realized that I've seen this behavior different times across
different ES versions in our cluster, and we do have a similar backup
process that could affect master response (always though it was kind of
network problem)

Thanks,

Frederic

On Tuesday, 14 February 2012 11:13:29 UTC-3, kimchy wrote:

They should restore the connection, yes. Can you easily repro this? If
so, can you try it with 0.19.RC2, there have been some enhancements there
with the mentioned scenario.

On Monday, February 13, 2012 at 9:08 PM, Lar Mader wrote:

We've been having a recurring problem in which our server and client
nodes get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have one
server node, and one client connecting as a node client with data set to
false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [61382ms] ago,
timed out [31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [31383ms] ago,
timed out [1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo]
received a join request for an existing node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
data=false}]

And then our tomcat logs at the same time, where the node client is
running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8] INFO

  • [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[
    /10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each with
    maximum [30s] timeout]
    2012-02-13 01:55:36,228 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] WARN - [Gideon Mace]
    master_left and no
    other node elected to become master, current nodes: {[Gideon
    Mace][cLwsTmSMQDeeAsW0EzJRsQ][inet[/10.205.4.76:9300]]{client=true,
    data=false}
    ,}
    2012-02-13 01:55:36,228 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
    removed {[Ammo][KRS
    6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
    zen-disco-master_failed
    ([Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]])
    2012-02-13 01:55:39,242 [elasticsearch[Gideon
    Mace]clusterService#updateTask-pool-30-thread-1] INFO - [Gideon Mace]
    detected_master [Am
    mo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]], added
    {[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]],}, reason:
    zen-disco-re
    ceive(from master
    [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:9300]]])

So even though it looks like the node client rejoined, they aren't
talking to each other. I have to restart the elasticsearch server to get
them talking again. One more thing is that we have a backup process that
runs on these servers, and causes some load on the system which is what is
causing the master to drop to out. But it seems like they should be able
to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(Shay Banon) #8

I have ran this: https://gist.github.com/2057402 without a server to
connect to (just creating a transport client, and adding a localhost
endpoint with no server behind it), and its fine, it simply bails because
it fails to open a socket against it (you can see it in the logging if you
turn it up).

In 0.19, we added another layer of timeout (defaults to 5 seconds) for the
samling intervals, which gives another layer of "protection" from
misbehaving servers. But, if the server is not even running, you basically
can't open a socket against it...

On Fri, Mar 16, 2012 at 6:23 PM, Frederic focampo.br@gmail.com wrote:

Some additional info based on a unit test for a custom ES clients factory (
https://gist.github.com/2050841):

The test creates either node or tranport client based on some config.
Creating a Transport one, it halted when adding 'localhost' as host and no
ES was running.

So, as far as I can see, when a TransportClient is created, and then some
hosts are added using

client.addTransportAddress(new InetSocketTransportAddress(host, port))

the client tries to discover an ES instance running on that hosts and
waits for a given time. It doesn't seem to be the default
discovery.zen.ping.timeout (I'm not defining that value when creating the
client) as it is 3sec and my client waited for really long time

Thus, I infer that, if one of servers is down, when adding it as a host to
the TC, the client waits for that server to be up, no mather if the others
are actually running, is this possible? If so, what would be the approach
to avoid this situation or set the discovery timeout to a lower value?

Thanks,

On Wednesday, 14 March 2012 16:55:33 UTC-3, Frederic wrote:

Hi Kimchy,

I haven't been able yet to reproduce in a simpler test this issue, but
today we faced the following similar situation:

The disk of one of the nodes failed so ES wasn't able to write on it. Our
index worker is a TranportClient with that, among others, server configured
as a transport host.
Right after the failure, the worker halted. I shutted down that node, the
worker was still inactive. Restarted several times the worker and it always
halted when trying to create the transport client.
I actually ended up shutting down and restarting the whole cluster,
always with the conflictive server down, and the worker still halted when
creating the Transport nodes.

So, finally I deployed a new version of the client removing the inactive
host from the configuration... and voila, it started and indexed docs
properly

Am I missing something regarding Tranport client behavior?

Cheer,

On Thursday, 8 March 2012 19:28:58 UTC-3, kimchy wrote:

Maybe the index requests just wait for the relevant shards to become
available? By default they will wait for 1 minute to do that (the timeout).
Anyway, they should not become stuck, if you can recreate it, it would be
great.

On Thursday, March 8, 2012 at 5:54 PM, Frederic wrote:

Hi Kimchy,

I've been doing some local tests regarding this using ES 0.19.0 and
found that, having couple of threads with one Transport client each
continously indexing data, if I shutdown ES I see the expected
NoNodeAvailableException's, but when I start it again, the exceptions stop
(as if the server was detected) but the clients hang and don't index any
more data.

If I restart the client it works again. Is it possible that, once you
create a client it stays bounded to the current node name and cannot
operate with a new one?

I'll try to recreate the scenario in a simpler test but now that I read
this post, I realized that I've seen this behavior different times across
different ES versions in our cluster, and we do have a similar backup
process that could affect master response (always though it was kind of
network problem)

Thanks,

Frederic

On Tuesday, 14 February 2012 11:13:29 UTC-3, kimchy wrote:

They should restore the connection, yes. Can you easily repro this? If
so, can you try it with 0.19.RC2, there have been some enhancements there
with the mentioned scenario.

On Monday, February 13, 2012 at 9:08 PM, Lar Mader wrote:

We've been having a recurring problem in which our server and client
nodes get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have one
server node, and one client connecting as a node client with data set to
false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [61382ms] ago,
timed out [31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][**inet[/10.205.4.76:9300]]{**client=true,
data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [31383ms] ago,
timed out [1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][**inet[/10.205.4.76:9300]]{**client=true,
data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo]
received a join request for an existing node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][**inet[/10.205.4.76:9300]]{**client=true,
data=false}]

And then our tomcat logs at the same time, where the node client is
running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8]
INFO - [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_

YtCYvnYHw][inet[
/10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]
2012-02-13 01:55:36,228 [elasticsearch[Gideon Mace]clusterService#**updateTask-pool-30-thread-1]
WARN - [Gideon Mace] master_left and no
other node elected to become master, current nodes: {[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][**inet[/10.205.4.76:9300]]{**client=true,
data=false}
,}
2012-02-13 01:55:36,228 [elasticsearch[Gideon Mace]clusterService#**updateTask-pool-30-thread-1]
INFO - [Gideon Mace] removed {[Ammo][KRS
6NIEsRTGT_YtCYvnYHw][inet[/10.**205.4.78:9300]],}, reason:
zen-disco-master_failed ([Ammo][KRS6NIEsRTGT_**YtCYvnYHw][inet[/
10.205.4.78:**9300]])
2012-02-13 01:55:39,242 [elasticsearch[Gideon Mace]clusterService#**updateTask-pool-30-thread-1]
INFO - [Gideon Mace] detected_master [Am
mo][KRS6NIEsRTGT_YtCYvnYHw][**inet[/10.205.4.78:9300]], added
{[Ammo][KRS6NIEsRTGT_**YtCYvnYHw][inet[/10.205.4.78:**9300]],}, reason:
zen-disco-re
ceive(from master [[Ammo][KRS6NIEsRTGT_YtCYvnYHw][inet[/10.205.4.78:
9300]]])

So even though it looks like the node client rejoined, they aren't
talking to each other. I have to restart the elasticsearch server to get
them talking again. One more thing is that we have a backup process that
runs on these servers, and causes some load on the system which is what is
causing the master to drop to out. But it seems like they should be able
to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(Frederic) #9

I have ran this: https://gist.github.com/2057402 without a server to
connect to (just creating a transport client, and adding a localhost
endpoint with no server behind it), and its fine, it simply bails because
it fails to open a socket against it (you can see it in the logging if
you turn it up).

I see, I'll increase the debug level (currently as INFO) in my app to see
if something arises.

In 0.19, we added another layer of timeout (defaults to 5 seconds) for the
samling intervals, which gives another layer of "protection"
from misbehaving servers. But, if the server is not even running, you
basically can't open a socket against it...

That's for sure, but what looked strange to me was that having some nodes
running fine (green status) but one down, the client didn't start because
such node was added. I needed to remove such server from the hosts list for
the client to start.

Nevermind, based on your code and results I guess it may be something on my
side.

Thanks,

On Saturday, 17 March 2012 07:32:51 UTC-3, kimchy wrote:

I have ran this: https://gist.github.com/2057402 without a server to
connect to (just creating a transport client, and adding a localhost
endpoint with no server behind it), and its fine, it simply bails because
it fails to open a socket against it (you can see it in the logging if you
turn it up).

In 0.19, we added another layer of timeout (defaults to 5 seconds) for the
samling intervals, which gives another layer of "protection" from
misbehaving servers. But, if the server is not even running, you basically
can't open a socket against it...

On Fri, Mar 16, 2012 at 6:23 PM, Frederic focampo.br@gmail.com wrote:

Some additional info based on a unit test for a custom ES clients factory
(https://gist.github.com/2050841):

The test creates either node or tranport client based on some config.
Creating a Transport one, it halted when adding 'localhost' as host and no
ES was running.

So, as far as I can see, when a TransportClient is created, and then some
hosts are added using

client.addTransportAddress(new InetSocketTransportAddress(host, port))

the client tries to discover an ES instance running on that hosts and
waits for a given time. It doesn't seem to be the default
discovery.zen.ping.timeout (I'm not defining that value when creating
the client) as it is 3sec and my client waited for really long time

Thus, I infer that, if one of servers is down, when adding it as a host
to the TC, the client waits for that server to be up, no mather if the
others are actually running, is this possible? If so, what would be the
approach to avoid this situation or set the discovery timeout to a lower
value?

Thanks,

On Wednesday, 14 March 2012 16:55:33 UTC-3, Frederic wrote:

Hi Kimchy,

I haven't been able yet to reproduce in a simpler test this issue, but
today we faced the following similar situation:

The disk of one of the nodes failed so ES wasn't able to write on it.
Our index worker is a TranportClient with that, among others, server
configured as a transport host.
Right after the failure, the worker halted. I shutted down that node,
the worker was still inactive. Restarted several times the worker and it
always halted when trying to create the transport client.
I actually ended up shutting down and restarting the whole cluster,
always with the conflictive server down, and the worker still halted when
creating the Transport nodes.

So, finally I deployed a new version of the client removing the inactive
host from the configuration... and voila, it started and indexed docs
properly

Am I missing something regarding Tranport client behavior?

Cheer,

On Thursday, 8 March 2012 19:28:58 UTC-3, kimchy wrote:

Maybe the index requests just wait for the relevant shards to become
available? By default they will wait for 1 minute to do that (the timeout).
Anyway, they should not become stuck, if you can recreate it, it would be
great.

On Thursday, March 8, 2012 at 5:54 PM, Frederic wrote:

Hi Kimchy,

I've been doing some local tests regarding this using ES 0.19.0 and
found that, having couple of threads with one Transport client each
continously indexing data, if I shutdown ES I see the expected
NoNodeAvailableException's, but when I start it again, the exceptions stop
(as if the server was detected) but the clients hang and don't index any
more data.

If I restart the client it works again. Is it possible that, once you
create a client it stays bounded to the current node name and cannot
operate with a new one?

I'll try to recreate the scenario in a simpler test but now that I read
this post, I realized that I've seen this behavior different times across
different ES versions in our cluster, and we do have a similar backup
process that could affect master response (always though it was kind of
network problem)

Thanks,

Frederic

On Tuesday, 14 February 2012 11:13:29 UTC-3, kimchy wrote:

They should restore the connection, yes. Can you easily repro this? If
so, can you try it with 0.19.RC2, there have been some enhancements there
with the mentioned scenario.

On Monday, February 13, 2012 at 9:08 PM, Lar Mader wrote:

We've been having a recurring problem in which our server and client
nodes get disconnected, seem to rejoin successfully, but are no longer
communicating with each other. Our setup is pretty simple - we have one
server node, and one client connecting as a node client with data set to
false.

Here is the only thing that seems relevant elasticsearch server log:
[2012-02-13 00:21:00,231][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [61382ms] ago,
timed out [31381ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][**inet[/10.205.4.76:9300]]{**client=true,
data=false}], id [203336]
[2012-02-13 00:21:00,232][WARN ][transport ] [Ammo]
Received response for a request that has timed out, sent [31383ms] ago,
timed out [1383ms] ago, action [discovery/zen/fd/ping], node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][**inet[/10.205.4.76:9300]]{**client=true,
data=false}], id [203337]
[2012-02-13 01:55:08,698][WARN ][discovery.zen ] [Ammo]
received a join request for an existing node [[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][**inet[/10.205.4.76:9300]]{**client=true,
data=false}]

And then our tomcat logs at the same time, where the node client is
running shows this:

2012-02-13 01:55:36,226 [elasticsearch[cached]-pool-20-thread-8]
INFO - [Gideon Mace] master_left [[Ammo][KRS6NIEsRTGT_

YtCYvnYHw][inet[
/10.205.4.78:9300]]], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]
2012-02-13 01:55:36,228 [elasticsearch[Gideon Mace]clusterService#**updateTask-pool-30-thread-1]
WARN - [Gideon Mace] master_left and no
other node elected to become master, current nodes: {[Gideon
Mace][cLwsTmSMQDeeAsW0EzJRsQ][**inet[/10.205.4.76:9300]]{**client=true,
data=false}
,}
2012-02-13 01:55:36,228 [elasticsearch[Gideon Mace]clusterService#**updateTask-pool-30-thread-1]
INFO - [Gideon Mace] removed {[Ammo][KRS
6NIEsRTGT_YtCYvnYHw][inet[/10.**205.4.78:9300]],}, reason:
zen-disco-master_failed ([Ammo][KRS6NIEsRTGT_**YtCYvnYHw][inet[/
10.205.4.78:**9300]])
2012-02-13 01:55:39,242 [elasticsearch[Gideon Mace]clusterService#**updateTask-pool-30-thread-1]
INFO - [Gideon Mace] detected_master [Am
mo][KRS6NIEsRTGT_YtCYvnYHw][**inet[/10.205.4.78:9300]], added
{[Ammo][KRS6NIEsRTGT_**YtCYvnYHw][inet[/10.205.4.78:**9300]],},
reason: zen-disco-re
ceive(from master [[Ammo][KRS6NIEsRTGT_*YtCYvnYHw][inet[/10.205.4.78:
*9300]]])

So even though it looks like the node client rejoined, they aren't
talking to each other. I have to restart the elasticsearch server to get
them talking again. One more thing is that we have a backup process that
runs on these servers, and causes some load on the system which is what is
causing the master to drop to out. But it seems like they should be able
to successfully reconnect afterwards.

Any ideas?
Thanks,
Lar


(system) #10