Cluster Hangs for 20 seconds, on a single node crush


I have a simple setup, with test data of 10 documents.
I have 3 nodes, 2 data 1 only master
I have 5 shards with 1 replica.

I run a search query every second via small simulator
I then disable the network card on the node that contains only the replicas.
My search queries are lagging - all of them, during the first 20 seconds post the card disable

So first call post the NIC down scenario, will get reply after 19s
Second call will get reply after 18s
Third call will get reply after 17s

I am using Elastic 6.7.1 - can someone elaborate on the root cause for this ?

How comes that killing 1 node my cluster hangs for 20 seconds ?

Thanks in advance,

How have you configured /proc/sys/net/ipv4/tcp_retries2? It defaults to 15 which is far too many IMO, and there are others who recommend reducing it to 3 for high-availability situations.

There's also an issue in older Elasticsearch versions (fixed in #39629, released in 7.2.0) that could slow down cluster state updates in your situation. I don't know that this will affect this experiment, unless you're disabling the NIC on the master, but I recommend upgrading to a later version.

1 Like

Thanks ! Will try to repeat the test on the latest...
I didn't mention I test it under windows boxes

Ah, ok, I think Windows has a similar kind of parameter to control TCP retries, but I don't know what it is.

Tested it on the latest, problem persists...

In general I don't understand the flow regardless of TCP/ linux.
Node A contain the primary shard
Nove B containg copy shard NIC is disabled

Why Node A is hanging for 20 seconds when I am searching data contained on its shards ?!

Normally a search will be distributed across the whole cluster, so I would expect it to try and search some of the shards on node B. If your OS is configured to retry transmission an unreasonable number of times before giving up then those remote searches could take a long time to fail.

I understand ES is round robin between the nodes, but I make a call every sec - all of the calls are hanged during this time.

Even if ES is distributing my search the local shard should reply and I expect to get the reply back.
I made this test with only one document in my index... latest code and still issue occur.

Please note that the test is disabling the NIC, if I kill the service all works perfect without this hang...

I think you misunderstand. Each search is distributed across the cluster, and is expected to involve the disconnected node.

I don't understand the logic in this design :
ES sends my search query to all nodes, lets say I have 5 nodes, where one of the has crushed.
Now I am getting replies from 4 nodes but instead of returning the results, the server will wait for the reply from node #5 that is down?

Right. The much more common case is that you don't have a failing node and there you want each search to use all the CPU/IO/etc. resources in the cluster, rather than restricting itself to a single node.

Elasticsearch will notice that the remote node is down as soon as the OS tells it the connection has dropped. The issue you're facing is that the OS is taking far longer than you would like to notice that the connection has dropped.

:slight_smile: Thanks for being patient...
Sending to several nodes makes sense, np

But why should server wait for all nodes to reply - why not return the reply once one of the node replied ?

BTW - can you try and direct me to the code that is in charge of this part of the flow ?

It only searches one copy of each shard, so it needs to collect all the responses (or failures) before it can respond.

It's hard to point at any one place that implements all this behaviour (it's actually quite complicated) but maybe this is a useful starting point?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.