We have a very large Elasticsearch cluster with the following node layout:
masters: 5
ingest-only: 4
data-only: 98
coordinating-only: 3
Coordinating-only nodes have 32 cores and 64 GB of memory, data nodes typically have 32 cores and 256 GB of memory. The cluster is spread across 3 zones in a single cloud region.
The cluster has a variety of activity, including search, aggregations, scroll, percolation, and continuous bulk ingestion.
Our response times have become very slow, which we initially attributed to the ever-increasing amount of data that we have to search through. However, when digging into it, we were astonished to find that with the same sample search query, response times were 10x - 100x less when querying a data node directly vs when querying coordinating-only nodes directly.
Here are some sample of "took" times when alternating sending the same query to different coordinating and data nodes. This is our live, production ES cluster, so the coordinating nodes are receiving the full brunt of the customer-facing traffic. The times listed are taken from the "took" property of the response.
1 coordinating-0 10. s
2 data-0 0.078 s
3 coordinating-1 0.031 s # <--- significant outlier
4 data-1 0.140 s
5 coordinating-2 12.7 s
6 data-2 0.106 s
7 coordinating-1 17.8 s
8 data-3 0.102 s
9 coordinating-0 17.2 s
Average coordinating node "took" time: 11.67 s
Average data node "took" time: 0.1065 s
We observe this result pattern across a much larger sample size and are confident that we are not just happening to hit caches when querying against data nodes and missing caches when querying against coordinating-only nodes.
At the time the above test was run, we had only 3 coordinating-only nodes. Increasing to 6 and then eventually 12 saw decrease in response time by a factor of about 3, and then 2, respectively. So adding coordaning-only nodes helps, but we are getting diminishing returns.
What could be the bottleneck? Even before increasing the node count, coordinating-only nodes were sitting at about 20% CPU utilization, their ES active thread queue was generally around 1, the system load was about 5, young garbage collection around 100 ms, old around 0, and network traffic was about 70 MiB/s. They have pretty low filesystem cache compared to data nodes (only 7 GB vs ~200 GB), but I wouldn't expect the coordinating-only nodes to use much FS cache.