We had a node out of a 7.6 cluster for a day and after starting ES on it and having shards allocated to it the node and cluster started acting badly. The node's CPU rose to roughly twice as much as the rest of the cluster's data nodes, had its search queue thread pool apparently maxed at 46 threads, and had a very large search queue (~300). The cluster's response time average rose 8x.
We turned off ARS and the cluster and node returned to performing normally.
We saw basically the same behavior in this cluster on two other occasions without any node removal/addition, while the cluster was under high load. Turning off ARS resolved the matter. Turning ARS back on afterwards did not cause a return of the behavior.
So I'm thinking that ARS might get into a bad state sometimes.
I'm curious if Elastic are interested in further details or have recommendations.
I didn't get hot threads at the time, but will next incident.
Now that we're watching active and queued searches per node more closely, we're seeing the same kind of problem in small amounts erratically during the cluster's heaviest load period. This is with ARS turned off. Looks like the problem might not be caused by ARS.
Here's search threads for this cluster today. Active searches maxing for the one node occasionally and queued searches then popping up course:
The common factor here seems to be just the node, maybe the particular shards on it. I'll look at that more closely. The fact that it spirals out of control somewhat suddenly -- it's kind of bimodal -- suggests a weird tipping point. Should ARS at least be able to shunt searches away from a badly distressed node?
Had another incident today. Errors started rising and we did a number of adjustments and the cluster settled down. While settled down I turned on ARS and the cluster started behaving badly.
At 2:44 ARS was turned on. At 2:58 ARS was turned off.
I also grabbed node stats at 14:53, and dug around to get ARS data from each coordinating node. For each coordinating node I only grabbed ARS info for six nodes: two that went high CPU (nodes 3366 and 3369), two that stayed in the middle (1863, 3120), and two from the group that went lower (3119, 3130). The null values indicate there wasn't an entry for these target nodes in the adaptive_selection object of the coordinating nodes. (For example, node stats for coordinating node 2954 show that there is null ARS information for data nodes 3366, 3369, and 3119.) (What's the term I should be using for "target node" here?)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.