Search distribution imbalance: Adaptive Replica Selection

rsk0 · September 2, 2020, 6:19pm

We had a node out of a 7.6 cluster for a day and after starting ES on it and having shards allocated to it the node and cluster started acting badly. The node's CPU rose to roughly twice as much as the rest of the cluster's data nodes, had its search queue thread pool apparently maxed at 46 threads, and had a very large search queue (~300). The cluster's response time average rose 8x.

We turned off ARS and the cluster and node returned to performing normally.

We saw basically the same behavior in this cluster on two other occasions without any node removal/addition, while the cluster was under high load. Turning off ARS resolved the matter. Turning ARS back on afterwards did not cause a return of the behavior.

So I'm thinking that ARS might get into a bad state sometimes.

I'm curious if Elastic are interested in further details or have recommendations.

warkolm · September 2, 2020, 8:50pm

Did you check hot threads on the node while this was happening? Do you have logs from that time?

rsk0 · September 3, 2020, 1:19am

I didn't get hot threads at the time, but will next incident.

Now that we're watching active and queued searches per node more closely, we're seeing the same kind of problem in small amounts erratically during the cluster's heaviest load period. This is with ARS turned off. Looks like the problem might not be caused by ARS.

Here's search threads for this cluster today. Active searches maxing for the one node occasionally and queued searches then popping up course:

(lines are active searches, dots are queued)

Here's what one of the more serious failures looked like (this is with ARS enabled until the end of the failure period):

The common factor here seems to be just the node, maybe the particular shards on it. I'll look at that more closely. The fact that it spirals out of control somewhat suddenly -- it's kind of bimodal -- suggests a weird tipping point. Should ARS at least be able to shunt searches away from a badly distressed node?

rsk0 · September 3, 2020, 4:30pm

If I have a 22K hot threads output I want to share, is the convention here to paste it directly or use a pastebin site?
Here's the paste.

rsk0 · September 5, 2020, 11:23pm

Had another incident today. Errors started rising and we did a number of adjustments and the cluster settled down. While settled down I turned on ARS and the cluster started behaving badly.

At 2:44 ARS was turned on. At 2:58 ARS was turned off.

Total cluster CPU dropped during ARS.

Individual CPUs splayed out. Two nodes got really bad, two elevated slightly, and the rest dropped.

Response time shot up during ARS.

ARS was somehow involved in the cluster going bad.

Here's a copy of hot threads for the cluster during ARS.

I also grabbed node stats at 14:53, and dug around to get ARS data from each coordinating node. For each coordinating node I only grabbed ARS info for six nodes: two that went high CPU (nodes 3366 and 3369), two that stayed in the middle (1863, 3120), and two from the group that went lower (3119, 3130). The null values indicate there wasn't an entry for these target nodes in the adaptive_selection object of the coordinating nodes. (For example, node stats for coordinating node 2954 show that there is null ARS information for data nodes 3366, 3369, and 3119.) (What's the term I should be using for "target node" here?)

ARS info from the coordinating nodes, sorted by target node shows:

high CPU target nodes: 1 null, 6 empty objects
middling CPU target nodes: 1 empty, 6 ARS detail data objects
low CPU target nodes: 7 null or 7 ARS detail data objects

Maybe the ranking gets confused when eligible nodes' information is a combination of missing, empty, and filled out?

rsk0 · September 10, 2020, 7:04pm

@warkolm, I wasn't able to figure out anything from the hot threads, how about you?

It looks like ARS is causing imbalance problems.

Anyone else having problems with ARS?

rsk0 · September 14, 2020, 5:40pm

It looks pretty clearly to me that ARS caused problems for this cluster. Would Elastic like me to file a bug or for me to follow up in some way?

warkolm · September 14, 2020, 9:38pm

I think that'd be a good idea.

system · October 12, 2020, 9:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster down after an autoreboot? Elasticsearch	5	575	March 8, 2018
Debugging performance decrease after a node fault Elasticsearch	4	634	February 3, 2018
Elasticsearch cpu spike, search thread pool queues explode Elasticsearch	10	2131	December 12, 2018
1 of 10 nodes CPU bound Elasticsearch	5	620	January 17, 2017
Restarting many nodes Elasticsearch	3	278	July 19, 2018

Search distribution imbalance: Adaptive Replica Selection

Related topics