I'm trying to figure out if expensive queries are causing our search queues to back up and have some questions related to how elasticsearch counts threads.
When a search request fans over multiple shards, does it use threads from each node's thread pool?
For example, if I have 10 nodes with 1 shard each (10 shards total) and I do a search request, would this search request be reflected as:
So IIUC: the "typical" case is to use 1 thread/ a node/shard and those come out of the node/shard's active thread pool.
Are the "non-typical" cases worth considering? Would there ever be more than 1 thread per shard? Is < 1 thread the search cases where requests are not hitting shards?
The general case is fairly complicated, and may differ between versions of Elasticsearch. For instance, searches are divided up into a number of phases, but not all phases run on all nodes. It's hard to know what might be salient to describe here in more detail. Perhaps it would be simpler for you to describe the problem you are investigating instead?
Yes, slow searches consume resources that can cause other searches involving the same nodes to be enqueued. Heavy indexing consumes CPU which can slow down searches on those nodes.
I think I would start by obtaining the output of GET _nodes/hot_threads while the cluster is struggling, as this will give us a clearer picture of what it's busy doing. You could also look for correlations between the spikes and the output of GET _nodes/stats, but hot threads would be the first thing I'd look at.
Are there any interesting-looking log messages at around the times of the spikes? For instance do you see any evidence that the nodes are performing more GC than normal?
I looked at an example queue spike from last week and didn't find anything too interesting:
There was one message about [2018-10-08T20:44:53,129][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-a-9.azurezp.com-node1] [gc][381638] overhead, spent [296ms] collecting in the last [1s] but that doesn't sound too bad
Our ad-hoc logging of _nodes/hot_threads didn't have anything interesting at the time (basically had no results)
We did a large increase of the number of bulk threads being used right before the search queue spike so I'm thinking that could be the cause of this particular spike.
Thank you for your interest/help. If I have more concrete questions/information I'll make a new topic.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.