Investigating a "stuck" node

I'm setting up a new search application on ES 7.3.1 and I am seeing some unusual behavior. I have 16 storage nodes and 5 coordinator nodes (no storage). All queries go though the coordinators. We have 400K docs, 4TB total storage across the primary and 2 replicas, 750 shards.

When we see load around 8 queries per second something strange starts to happen. Across a span of about 7 hours we see the latency slowly creep up from 200ms to 1.2s and we will start queuing and then rejecting queries. Then suddenly (across a minute or 2) the latency will drop back down to 200ms and start again slowly ramping up.

Digging in a little deeper we see from _nodes/stats that the time spent in query is low across all except for one node and that node seems to be maxed out at ~49seconds per wall clock second (corresponding to the number of search threads). Similarly if you look at the number of queries in progress, most nodes are 0ish but the busy node has 20 or 30 queries in progress (and the number changes frequently).

The most mysterious part is that the CPU, memory, network looks about the same on the stuck node versus a healthy node. So it's as if a single node just didn't feel like working. I would say that maybe we have a bad node, but the thing is, the node that shows this behavior swaps to another node pretty regularly.

Other information, for the queries we're receiving right now, we use no routing on search side, so all queries go to all nodes.

Are you using the Monitoring functionality?

We are monitoring with Data Dog and I'm also rummaging around through _stats, _note/stats, _tasks, and query profiles.

What is happening in the logs of the node?
Can you check hot threads?

If CPU the same, then would suspect slow / bad IO - what is underlying infrastructure like AWS, VMWare, etc. and can you replace this node or its disks?

IO would be easily visible in load average and iostat, etc. as slow disks, queue, etc. Datadog should show this also for Linux.

Datadog showing any GC issue or variance with other nodes, latency on that node, etc.?

Assume affects indexing, too? That would be more obvious in stats on any index with shards on that node as they have to complete before the bulk index call returns. Might be hard to get and with only 400K docs (seems 3M/doc; not small), maybe you not indexing much.

You don't mention master nodes; but I assume that node is not a master.

@warkolm normal logs are silent (I'm not sure what the log level is, but there are not ERROR lines and only 1 or so INFO lines per day). Slowlogs has some traffic, but it's not a lot really even though we are clearly seeing this one node take lots more time than the other nodes. Also, all the slow queries look just like all the other queries. I ran hot threads across the cluster several times, and the "stuck" node seems to have the same behavior as all the other nodes. Mostly I see management threads. Very rarely will I catch a search thread.

@steve_mushero I looked at _node/stats for the stuck node and for a healthy node and the differences seem to be symptoms (rather than underlying causes) or are irrelevant:

  • search.open_contexts: stuck=270, healthy=0
  • search.query_time_in_millis: stuck=5.5B, healthy=2.2B (this is across the lifetime, so it's indicative that the stuck node has been stuck for a while)
  • search.fetch_total: stuck=7.8M, healthy=40M
  • merges.total_docs: stuck=138M, healthy=77.7M
  • segments.index_writer_memory_in_bytes: stuck=0, healthy=1.4M
  • docs.deleted: stuck=20M, healthy=6.8M
  • process.{open_file_descriptors,cpu.percent,mem}: about the same
  • jvm.mem: about the same
  • threads.search.active: stuck=49, healthy=0 - lots more in queue and rejected
  • fh/data/io_stats/transport/http - almost all the same

We upgraded to 7.9 and the stuck node problem hasn't popped up again. Was a bug fixed between 7.3.1 and 7.9.1?