We had a node out of a 7.6 cluster for a day and after starting ES on it and having shards allocated to it the node and cluster started acting badly. The node's CPU rose to roughly twice as much as the rest of the cluster's data nodes, had its search queue thread pool apparently maxed at 46 threads, and had a very large search queue (~300). The cluster's response time average rose 8x.
We turned off ARS and the cluster and node returned to performing normally.
We saw basically the same behavior in this cluster on two other occasions without any node removal/addition, while the cluster was under high load. Turning off ARS resolved the matter. Turning ARS back on afterwards did not cause a return of the behavior.
So I'm thinking that ARS might get into a bad state sometimes.
I'm curious if Elastic are interested in further details or have recommendations.