Processing concentration on some cluster nodes

Hi, we currently have a cluster elasticsearch consisting of 18 data nodes and 3 master nodes (instances m4.4xlarge - 16 core and 64Gb RAM distributed in 2 zones in aws). We use elasticsearch version 5.6.2 with java jre1.8.0_162 64 bits (with -Xms30g -Xmx30g). The application connects to the cluster via transport client java (by data nodes). Heavy searches basically access 2 indexes. Each index has 12 shards and 2 replicas. One index has 37 million documents and 78Gb of data and the other index has 49 million documents and 230Gb of data.
In times of high load, we notice that from 3 to 4 nodes are with cpu close to 100% and the other nodes of the cluster with 65 to 70%. The total latency of the cluster goes up. The high processing in few nodes limit the total throughput of the cluster. Is there any reason for processing to stay focused on some nodes in the cluster? How could we better distribute the processing in the cluster?
We have already tried to use Coordinating Node to connect to the cluster, but the cpu concentration occurs in the same way.

Are shards distributed evenly across the cluster? Are you using any features that can cause uneven load across an index, e.g. routing or parent-child? Are you performing a lot of scripted updates?

Hi Christian,

Are shards distributed evenly across the cluster?
Yes

Are you using any features that can cause uneven load across an index, e.g. routing or parent-child?
No

Are you performing a lot of scripted updates?
In our tests there is no indexing. We do not use script in the queries, only aggregations and function score

If you run the hot threads API on one of the busy nodes and compare it to one of the less busy ones, do you see any differences? Is it always the same nodes that are more busy? Do you see any differences in I/O performance between the nodes? Are requests evenly distributes across all nodes in the cluster?

If you run the hot threads API on one of the busy nodes and compare it to one of the less busy ones, do you see any differences?
I had looked, but I did not notice any relevant difference

Is it always the same nodes that are more busy?
In general they are the same machines, but it is not a rule

Do you see any differences in I/O performance between the nodes?
All nodes have the same hardware and configuration

Are requests evenly distributes across all nodes in the cluster?
Yes, but we do not know if this concentration can be any problem on the transpont client java (we only listing ip of data node) or internally within the elasticsearch. At times we have the feeling that elasticsearch prioritizes machines that only have replicas, but that is not a rule either. What we notice is that processing is concentrated on some nodes. How does elasticsearch decide which machines will be used for each request? Is there any prioritization depending on response time or the last machine that responded by given shard?

Anyone have any ideas about the problem?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.