Thread pool size and the cluster settings API


I am a bit confused on the cluster settings API. The thread pool settings seem to be specific to a node.

For example, in the same cluster, some nodes have 1 CPU and some have 5 CPUs. When I run the _cluster/settings API, the settings shown are different based on where I run them:

curl -s '' | jq '.defaults.thread_pool.write'
  "queue_size": "200",
  "size": "5"
curl -s '' | jq '.defaults.thread_pool.write'
  "queue_size": "200",
  "size": "1"

(The ones with the 5 CPU are the data nodes, BTW)

If I wanted to update this setting and increase the queue_size to say 300 or size to 8, should I run the -X PUT version separately on each of the data nodes where I want it increased?

I am a bit confused because I was under the impression that the _cluster/settings API is for working with cluster-wide settings as the page says "Use this API to review and change cluster-wide settings.", and I am not sure if I am misunderstanding anything here with respect to these different queue size responses based on which node responds to the API.

Thank you

I would suggest reading Should I increase my threadpool size if I get rejected executions or HTTP 429 responses? before changing threadpool settings.

Yes, I get that; the queue is not full all the time; but occasionally it does reject a few tens of requests per thousand.
But when these tens of requests get rejected, the system that sends data to Elasticsearch takes a costly route to perform a retry, which quickly magnifies upstream and takes several seconds to resolve.

If the queue was just a bit larger to accommodate these 30-40 extra requests once in a while, or if there were more threads consuming from the pool, it would solve the problem as the CPU utilization of Elasticsearch nodes is well under limits.

Hence the interest in this _cluster/settings API for the pool sizes

Elasticsearch is more often limited by disk performance than CPU so it may be worth monitoring this at peak times. What is the specification of your cluster? What type of hardware are you using?

The load is at a fairly constant rate of homogeneous requests from machine-generated data, so there's no peak in the pattern. But yes, the disk might be misbehaving once in a while; It's an AWS EBS gp2 volume and they say on their web page that "AWS designs gp2 volumes to deliver their provisioned performance 99% of the time." So, for that remaining 1% of the time, the queue could very well fill up when operating at near the provisioned iops threshold of the volume.

While one solution would be to have a disk with a higher iops, it is more expensive than increasing the queue size by a few tens.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.