We are operating some Elasticsearch clusters and have noticed in the most queried cluster periodically the
search_coordination thread pool can't keep up, resulting in a queue that then translates to high latency for the end consumer.
For reference, the cluster in question has 3 master nodes, 6 coordinator nodes, and ~50 data nodes. Interestingly, I would have suspected the issue to occur on the coordinator nodes due to the pool naming but no, the pool is being saturated on the data nodes. The data nodes are pretty large, over a dozen cores and a couple hundred GB of memory with fast 1TB SSD disks attached.
In the documentation I see the below for
search_coordination default sizing:
For lightweight search-related coordination operations. Thread pool type is
fixedwith a size of a max of
# of allocated processors
) / 2), and queue_size of
The wording is a bit confusing but if I am interpreting it correctly, the max number of threads by default is 5. My question is, can we increase the number of threads dedicated to
search_coordination (I believe that is a 'yes' via the
elasticsearch.yml file) and my second question is should we increase the static size?
From what I have read, tuning the defaults is not recommended but the threads not keeping up with demand is causing consumer impact. Is increasing dedicated thread pool size a good idea? Is there something else that is lower hanging fruit? Going to more numerous but smaller data nodes perhaps?