Hello,
We are operating some Elasticsearch clusters and have noticed in the most queried cluster periodically the search_coordination
thread pool can't keep up, resulting in a queue that then translates to high latency for the end consumer.
For reference, the cluster in question has 3 master nodes, 6 coordinator nodes, and ~50 data nodes. Interestingly, I would have suspected the issue to occur on the coordinator nodes due to the pool naming but no, the pool is being saturated on the data nodes. The data nodes are pretty large, over a dozen cores and a couple hundred GB of memory with fast 1TB SSD disks attached.
In the documentation I see the below for search_coordination
default sizing:
For lightweight search-related coordination operations. Thread pool type is
fixed
with a size of a max ofmin(5, (
# of allocated processors
) / 2)
, and queue_size of1000
.
The wording is a bit confusing but if I am interpreting it correctly, the max number of threads by default is 5. My question is, can we increase the number of threads dedicated to search_coordination
(I believe that is a 'yes' via the elasticsearch.yml
file) and my second question is should we increase the static size?
From what I have read, tuning the defaults is not recommended but the threads not keeping up with demand is causing consumer impact. Is increasing dedicated thread pool size a good idea? Is there something else that is lower hanging fruit? Going to more numerous but smaller data nodes perhaps?
Thank you