Uneven load onto data nodes in different AWS Availability Zones

We have data nodes in 3 different AWS AZs (availability zone), we have 3 separate coordinator nodes.
We noticed that data nodes in one of the AZ experience less load then the nodes in the other 2 AZs.
It turned out that one of the coordinator nodes was in the wrong AZ – it was in the availability zone in which cluster has no data nodes at all.

In other words, we had data nodes in A, B, C zones and had 3 coordinator nodes in A, B, D but not in C. This resulted in data nodes in zone "C" to receive less load: lower CPU usage, etc.

When we replaced a coordinator node in AZ D with a node in AZ C load became balanced.

Our coordinator nodes are behind an ELB and ELB metrics show all 3 coordinator nodes were receiving same amount of requests.

We have AWS zone awareness plugin enabled which makes sure no primary and replica of any shard are in the same AZ.
We have 40 nodes in the cluster and every index has exactly 40 shards (20 primary and 20 replicas) – hence are shard distribution across the nodes is perfectly even.
The ElasticSearch version is 6.8.

Is there anything that would make a coordinator node to "prefer" data nodes in the same zone and avoid routing requests to the data nodes in different AZs?

NOTE: We don't have an adaptive replica selection feature enabled.

Yes, allocation awareness does that:

Elasticsearch prefers using shards in the same location (with the same awareness attribute values) to process search or GET requests. Using local shards is usually faster than crossing rack or zone boundaries.

That's true of all versions released today, but it won't in 8.0.0 and in 7.5.0 there is a system property to disable this behaviour too.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.