I'm setting up a new search application on ES 7.3.1 and I am seeing some unusual behavior. I have 16 storage nodes and 5 coordinator nodes (no storage). All queries go though the coordinators. We have 400K docs, 4TB total storage across the primary and 2 replicas, 750 shards.
When we see load around 8 queries per second something strange starts to happen. Across a span of about 7 hours we see the latency slowly creep up from 200ms to 1.2s and we will start queuing and then rejecting queries. Then suddenly (across a minute or 2) the latency will drop back down to 200ms and start again slowly ramping up.
Digging in a little deeper we see from _nodes/stats that the time spent in query is low across all except for one node and that node seems to be maxed out at ~49seconds per wall clock second (corresponding to the number of search threads). Similarly if you look at the number of queries in progress, most nodes are 0ish but the busy node has 20 or 30 queries in progress (and the number changes frequently).
The most mysterious part is that the CPU, memory, network looks about the same on the stuck node versus a healthy node. So it's as if a single node just didn't feel like working. I would say that maybe we have a bad node, but the thing is, the node that shows this behavior swaps to another node pretty regularly.
Other information, for the queries we're receiving right now, we use no routing on search side, so all queries go to all nodes.