Cluster Hangs for 20 seconds, on a single node crush

ran_n · September 5, 2019, 7:53am

Hi,

I have a simple setup, with test data of 10 documents.
I have 3 nodes, 2 data 1 only master
I have 5 shards with 1 replica.

I run a search query every second via small simulator
I then disable the network card on the node that contains only the replicas.
My search queries are lagging - all of them, during the first 20 seconds post the card disable

So first call post the NIC down scenario, will get reply after 19s
Second call will get reply after 18s
Third call will get reply after 17s

I am using Elastic 6.7.1 - can someone elaborate on the root cause for this ?

How comes that killing 1 node my cluster hangs for 20 seconds ?

Thanks in advance,
Ran

DavidTurner · September 5, 2019, 8:23am

How have you configured /proc/sys/net/ipv4/tcp_retries2? It defaults to 15 which is far too many IMO, and there are others who recommend reducing it to 3 for high-availability situations.

There's also an issue in older Elasticsearch versions (fixed in #39629, released in 7.2.0) that could slow down cluster state updates in your situation. I don't know that this will affect this experiment, unless you're disabling the NIC on the master, but I recommend upgrading to a later version.

ran_n · September 5, 2019, 8:39am

Thanks ! Will try to repeat the test on the latest...
I didn't mention I test it under windows boxes

DavidTurner · September 5, 2019, 8:41am

Ah, ok, I think Windows has a similar kind of parameter to control TCP retries, but I don't know what it is.

ran_n · September 5, 2019, 11:15am

Tested it on the latest, problem persists...

ran_n · September 5, 2019, 11:21am

In general I don't understand the flow regardless of TCP/ linux.
Node A contain the primary shard
Nove B containg copy shard NIC is disabled

Why Node A is hanging for 20 seconds when I am searching data contained on its shards ?!

DavidTurner · September 5, 2019, 11:27am

Normally a search will be distributed across the whole cluster, so I would expect it to try and search some of the shards on node B. If your OS is configured to retry transmission an unreasonable number of times before giving up then those remote searches could take a long time to fail.

ran_n · September 5, 2019, 11:39am

I understand ES is round robin between the nodes, but I make a call every sec - all of the calls are hanged during this time.

Even if ES is distributing my search the local shard should reply and I expect to get the reply back.
I made this test with only one document in my index... latest code and still issue occur.

Please note that the test is disabling the NIC, if I kill the service all works perfect without this hang...

DavidTurner · September 5, 2019, 11:47am

I think you misunderstand. Each search is distributed across the cluster, and is expected to involve the disconnected node.

ran_n · September 5, 2019, 12:13pm

I don't understand the logic in this design :
ES sends my search query to all nodes, lets say I have 5 nodes, where one of the has crushed.
Now I am getting replies from 4 nodes but instead of returning the results, the server will wait for the reply from node #5 that is down?

DavidTurner · September 5, 2019, 12:22pm

Right. The much more common case is that you don't have a failing node and there you want each search to use all the CPU/IO/etc. resources in the cluster, rather than restricting itself to a single node.

Elasticsearch will notice that the remote node is down as soon as the OS tells it the connection has dropped. The issue you're facing is that the OS is taking far longer than you would like to notice that the connection has dropped.

ran_n · September 5, 2019, 12:27pm

Thanks for being patient...
Sending to several nodes makes sense, np

But why should server wait for all nodes to reply - why not return the reply once one of the node replied ?

BTW - can you try and direct me to the code that is in charge of this part of the flow ?

DavidTurner · September 5, 2019, 12:36pm

It only searches one copy of each shard, so it needs to collect all the responses (or failures) before it can respond.

It's hard to point at any one place that implements all this behaviour (it's actually quite complicated) but maybe this is a useful starting point?

github.com

elastic/elasticsearch/blob/a1027881972afb07f262a4bf07447e7f0b6e1b28/server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

/*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */

package org.elasticsearch.action.search;

This file has been truncated. show original

system · October 3, 2019, 12:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Request to elasticsearch cluster hangs Elasticsearch	1	1145	July 5, 2017
Cluster recovery and reachability takes long time when master left Elasticsearch	11	2580	March 19, 2019
Half-dead node lead to cluster hang Elasticsearch	32	3374	March 20, 2018
Intermittently search slow response Elasticsearch	4	837	July 6, 2017
Cluster node unresponsive after search Elasticsearch	2	671	July 5, 2017

Cluster Hangs for 20 seconds, on a single node crush

Related topics