Investigating a "stuck" node

JnBrymn · September 16, 2020, 4:50pm

I'm setting up a new search application on ES 7.3.1 and I am seeing some unusual behavior. I have 16 storage nodes and 5 coordinator nodes (no storage). All queries go though the coordinators. We have 400K docs, 4TB total storage across the primary and 2 replicas, 750 shards.

When we see load around 8 queries per second something strange starts to happen. Across a span of about 7 hours we see the latency slowly creep up from 200ms to 1.2s and we will start queuing and then rejecting queries. Then suddenly (across a minute or 2) the latency will drop back down to 200ms and start again slowly ramping up.

Digging in a little deeper we see from _nodes/stats that the time spent in query is low across all except for one node and that node seems to be maxed out at ~49seconds per wall clock second (corresponding to the number of search threads). Similarly if you look at the number of queries in progress, most nodes are 0ish but the busy node has 20 or 30 queries in progress (and the number changes frequently).

The most mysterious part is that the CPU, memory, network looks about the same on the stuck node versus a healthy node. So it's as if a single node just didn't feel like working. I would say that maybe we have a bad node, but the thing is, the node that shows this behavior swaps to another node pretty regularly.

Other information, for the queries we're receiving right now, we use no routing on search side, so all queries go to all nodes.

warkolm · September 16, 2020, 9:48pm

Are you using the Monitoring functionality?

JnBrymn · September 16, 2020, 10:54pm

We are monitoring with Data Dog and I'm also rummaging around through _stats, _note/stats, _tasks, and query profiles.

warkolm · September 16, 2020, 11:06pm

What is happening in the logs of the node?
Can you check hot threads?

Steve_Mushero · September 17, 2020, 2:32am

If CPU the same, then would suspect slow / bad IO - what is underlying infrastructure like AWS, VMWare, etc. and can you replace this node or its disks?

IO would be easily visible in load average and iostat, etc. as slow disks, queue, etc. Datadog should show this also for Linux.

Datadog showing any GC issue or variance with other nodes, latency on that node, etc.?

Assume affects indexing, too? That would be more obvious in stats on any index with shards on that node as they have to complete before the bulk index call returns. Might be hard to get and with only 400K docs (seems 3M/doc; not small), maybe you not indexing much.

You don't mention master nodes; but I assume that node is not a master.

JnBrymn · September 17, 2020, 4:18pm

@warkolm normal logs are silent (I'm not sure what the log level is, but there are not ERROR lines and only 1 or so INFO lines per day). Slowlogs has some traffic, but it's not a lot really even though we are clearly seeing this one node take lots more time than the other nodes. Also, all the slow queries look just like all the other queries. I ran hot threads across the cluster several times, and the "stuck" node seems to have the same behavior as all the other nodes. Mostly I see management threads. Very rarely will I catch a search thread.

@steve_mushero I looked at _node/stats for the stuck node and for a healthy node and the differences seem to be symptoms (rather than underlying causes) or are irrelevant:

search.open_contexts: stuck=270, healthy=0
search.query_time_in_millis: stuck=5.5B, healthy=2.2B (this is across the lifetime, so it's indicative that the stuck node has been stuck for a while)
search.fetch_total: stuck=7.8M, healthy=40M
merges.total_docs: stuck=138M, healthy=77.7M
segments.index_writer_memory_in_bytes: stuck=0, healthy=1.4M
docs.deleted: stuck=20M, healthy=6.8M
process.{open_file_descriptors,cpu.percent,mem}: about the same
jvm.mem: about the same
threads.search.active: stuck=49, healthy=0 - lots more in queue and rejected
fh/data/io_stats/transport/http - almost all the same

JnBrymn · September 17, 2020, 10:44pm

We upgraded to 7.9 and the stuck node problem hasn't popped up again. Was a bug fixed between 7.3.1 and 7.9.1?

system · October 15, 2020, 10:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
1 Node gets stuck with high load and 0% disk idle Elasticsearch	3	387	November 4, 2019
Adding and removing a node fixed our latency issues Elasticsearch	3	554	January 4, 2022
Debugging performance decrease after a node fault Elasticsearch	4	634	February 3, 2018
Newbie performance troubleshooting, high load spikes on ES nodes Elasticsearch	5	5034	June 11, 2018
1 node in an elasticsearch cluster getting stuck for 15 minutes and then starts working Elasticsearch	3	92	June 27, 2024

Investigating a "stuck" node

Related Topics