Search distribution imbalance: Adaptive Replica Selection

We had a node out of a 7.6 cluster for a day and after starting ES on it and having shards allocated to it the node and cluster started acting badly. The node's CPU rose to roughly twice as much as the rest of the cluster's data nodes, had its search queue thread pool apparently maxed at 46 threads, and had a very large search queue (~300). The cluster's response time average rose 8x.

We turned off ARS and the cluster and node returned to performing normally.

We saw basically the same behavior in this cluster on two other occasions without any node removal/addition, while the cluster was under high load. Turning off ARS resolved the matter. Turning ARS back on afterwards did not cause a return of the behavior.

So I'm thinking that ARS might get into a bad state sometimes.

I'm curious if Elastic are interested in further details or have recommendations.

Did you check hot threads on the node while this was happening? Do you have logs from that time?

I didn't get hot threads at the time, but will next incident.

Now that we're watching active and queued searches per node more closely, we're seeing the same kind of problem in small amounts erratically during the cluster's heaviest load period. This is with ARS turned off. Looks like the problem might not be caused by ARS.

Here's search threads for this cluster today. Active searches maxing for the one node occasionally and queued searches then popping up course:

(lines are active searches, dots are queued)

Here's what one of the more serious failures looked like (this is with ARS enabled until the end of the failure period):

The common factor here seems to be just the node, maybe the particular shards on it. I'll look at that more closely. The fact that it spirals out of control somewhat suddenly -- it's kind of bimodal -- suggests a weird tipping point. Should ARS at least be able to shunt searches away from a badly distressed node?

If I have a 22K hot threads output I want to share, is the convention here to paste it directly or use a pastebin site?
Here's the paste.

Had another incident today. Errors started rising and we did a number of adjustments and the cluster settled down. While settled down I turned on ARS and the cluster started behaving badly.

At 2:44 ARS was turned on. At 2:58 ARS was turned off.

Total cluster CPU dropped during ARS.

Individual CPUs splayed out. Two nodes got really bad, two elevated slightly, and the rest dropped.

Response time shot up during ARS.

ARS was somehow involved in the cluster going bad.

Here's a copy of hot threads for the cluster during ARS.

I also grabbed node stats at 14:53, and dug around to get ARS data from each coordinating node. For each coordinating node I only grabbed ARS info for six nodes: two that went high CPU (nodes 3366 and 3369), two that stayed in the middle (1863, 3120), and two from the group that went lower (3119, 3130). The null values indicate there wasn't an entry for these target nodes in the adaptive_selection object of the coordinating nodes. (For example, node stats for coordinating node 2954 show that there is null ARS information for data nodes 3366, 3369, and 3119.) (What's the term I should be using for "target node" here?)

ARS info from the coordinating nodes, sorted by target node shows:

  • high CPU target nodes: 1 null, 6 empty objects
  • middling CPU target nodes: 1 empty, 6 ARS detail data objects
  • low CPU target nodes: 7 null or 7 ARS detail data objects

Maybe the ranking gets confused when eligible nodes' information is a combination of missing, empty, and filled out?

@warkolm, I wasn't able to figure out anything from the hot threads, how about you?

It looks like ARS is causing imbalance problems.

Anyone else having problems with ARS?

It looks pretty clearly to me that ARS caused problems for this cluster. Would Elastic like me to file a bug or for me to follow up in some way?

I think that'd be a good idea.