Coordinating-only nodes are a _significant_ bottleneck for our cluster

We have a very large Elasticsearch cluster with the following node layout:

masters: 5
ingest-only: 4
data-only: 98
coordinating-only: 3

Coordinating-only nodes have 32 cores and 64 GB of memory, data nodes typically have 32 cores and 256 GB of memory. The cluster is spread across 3 zones in a single cloud region.

The cluster has a variety of activity, including search, aggregations, scroll, percolation, and continuous bulk ingestion.

Our response times have become very slow, which we initially attributed to the ever-increasing amount of data that we have to search through. However, when digging into it, we were astonished to find that with the same sample search query, response times were 10x - 100x less when querying a data node directly vs when querying coordinating-only nodes directly.

Here are some sample of "took" times when alternating sending the same query to different coordinating and data nodes. This is our live, production ES cluster, so the coordinating nodes are receiving the full brunt of the customer-facing traffic. The times listed are taken from the "took" property of the response.

1     coordinating-0     10.    s
2     data-0              0.078 s
3     coordinating-1      0.031 s      # <--- significant outlier
4     data-1              0.140 s
5     coordinating-2     12.7   s
6     data-2              0.106 s
7     coordinating-1     17.8   s
8     data-3              0.102 s
9     coordinating-0     17.2   s

Average coordinating node "took" time: 11.67 s
Average data node "took" time: 0.1065 s

We observe this result pattern across a much larger sample size and are confident that we are not just happening to hit caches when querying against data nodes and missing caches when querying against coordinating-only nodes.

At the time the above test was run, we had only 3 coordinating-only nodes. Increasing to 6 and then eventually 12 saw decrease in response time by a factor of about 3, and then 2, respectively. So adding coordaning-only nodes helps, but we are getting diminishing returns.


What could be the bottleneck? Even before increasing the node count, coordinating-only nodes were sitting at about 20% CPU utilization, their ES active thread queue was generally around 1, the system load was about 5, young garbage collection around 100 ms, old around 0, and network traffic was about 70 MiB/s. They have pretty low filesystem cache compared to data nodes (only 7 GB vs ~200 GB), but I wouldn't expect the coordinating-only nodes to use much FS cache.

Which version of Elasticsearch are you using? What is the use case? What does the workload look like in terms of index/query/update ratios and throughput? What type of instances are you using?

We are using Elasticsearch 7.9.3

We have terabytes of full-text data, typically "conversation" style data as found on web-based communities like this one. The primary use case is real-time search over these data as well as percolating alerts in real-time as data is ingested. The cluster powers a search portal where users can run queries and view results, as well as configure alerts for percolation. A secondary use case to allow users to run searches directly instead of going through the UI. Some users use this to bulk download data that matches a particular query (leveraging the scroll feature of Elasticsearch).

I don't have the index/query/update ratios, nor throughputs available, but will work on getting those.

I touched on the instance-types briefly:

Coordinating-only nodes have 32 cores and 64 GB of memory, data nodes typically have 32 cores and 256 GB of memory.

Is there more detail about the instances that would be helpful?

Elasticsearch 7.9 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

It would be useful to know the instance type as limits around networking could be an issue.

This is very old so I would recommend you upgrade, at least to the latest 7.17 release. Do you have any third party plugins installed that could affect performance?

If coordinating only nodes are a bottleneck, have you tried sending all traffic directly to the data nodes? What was the reason you added coordinating only nodes in the first place?

Do you have any third party plugins installed that could affect performance?

We have searchguard installed, though it's unclear if that impacts performance. The other plugins we have are internal I think (analysis-icu, analysis-smartcn, and discovery and recovery plugins for EC2/S3 and GCE/GCS (though our cluster is fully hosted on only one cloud)).

I have myself not used SearchGuard so do not know what its performance characterists are. One of the tasks the coordinating-only nodes do handle is security, so even though it may be unlikely I would not rule out that it may have an impact.

If coordinating only nodes are a bottleneck, have you tried sending all traffic directly to the data nodes?

What was the reason you added coordinating only nodes in the first place? I have seen a lot of large clusters work perfectly fine without the use of coordinating only nodes.

If coordinating only nodes are a bottleneck, have you tried sending all traffic directly to the data nodes? What was the reason you added coordinating only nodes in the first place?

We are considering trying sending all traffic to data nodes as a next step.

We added coordinating nodes as is generally recommended by clusters of this size. We've found it helps isolate failures and is especially helpful for failures occurring in the connection and coordinating stages. It's a lot lighter-weight to restart those nodes vs nodes laden with data that could potentially cause shard reallocation.

Interesting. I don't know if we can test running any queries without searchguard at all.. but this is definitely something to look into.

Following up here with an update for those interested.

We experimented putting searchguard on coordinating-only nodes (making them data and coordinating nodes, but with just the searchguard index) and did not notice a difference in performance.

We then experimented with sending only search and aggregation requests from our UI to a set of 4 dedicated coordinating nodes and the remaining traffic (search requests from our API, scrolling, and percolation) to the remaining 8 coordinating nodes. The result is that the UI requests to the 4 dedicated coordinating nodes perform on par with our sampling of requests going directly to data nodes (huzzah!).

Given that this alleviates the major pain point for our product and we're less concerned with the performance of other use-cases, we may or may not pursue this further to isolate which of the other use cases might be causing the issue. Or maybe it's not a matter of use case but traffic volume, since these UI-only coordinating nodes are handling significantly fewer requests than the 8 "everything-else" coordinating nodes.