Elasticsearch high CPU usage due to one index and replica doesnt help

Hi all,

We have an ES cluster spanning different cloud regions, say region1 and region2. We have below configuration on elasticsearch.yml (this one for region1 machines) so an index with 1p:1r will have a either a primary or replica in each region.

cluster.routing.allocation.awareness.attributes: region
node.attr.region: region1

We route our traffic to one region through DNS+load balancer and we started seeing an issue with one of the indices recently.

Lets say, we are currently routing traffic through DNS to region1, now an index, say index1 has its primary on a node in region1 and replica on a node in region 2.

Due to some excessive querying for data in index1, CPU usage on node thats hosting primary of the index in region1 gets really exhausted, but we also notice that the node in region2 which has the replica doesnt have much activity.

Shouldn't Elasticsearch be routing/deviding traffic and get the node with the replica to help as well in such situation?

It is generally not recommended to deploy Elasticsearch across regions unless possibly if they are quite close and offer very low latencies between them. With just 2 regions it is also impossible to make the cluster HA, so a third region may be required.

Which version of Elasticsearch are you using? If you are on a reasonably new version, Elasticsearch by default uses adaptive replica selection when executing a query. If you have long latencies between the regions queries executed against the remote shard will be slower and the local shard is likely to be favoured. You may want to experiment with disabling this, but that will send approximately half of queries to the remote shard, which could increase latencies for all/most queries.

1 Like

We have 7.17.x. Yeah, increased latency is a tradeoff but looking at the situation we have with adaptive replica selection enabled, disabling could be relatively better state overall. We will try this out. Thank you

Yes. But we had to stick to this setup due to budget related concerns. We are trying to have a voting node on a third location to address this. Thank you again for all the suggestions