Processing concentration on some cluster nodes - The return

Hi,
I had opened the topic Processing concentration on some cluster nodes about the concentration of processing on a few nodes of the elasticsearch cluster. We performed several tests and noticed the following behavior: when we use only 1 replica in the cluster, the concentration of processing in few nodes of the cluster does not occur. The concentration only occurs with the use of 2 replicas. As commented in the previous topic, we have 2 main indexes, with 12 shards each index and 18 data nodes. We did tests accessing the cluster directly by rest, because we suspected that it was some problem with the transport client, but by rest, the problem with concentration in a few nodes too occurs. Can there be a bug with the use of 2 replicas? (other people should usually use only 1 replica)

The situation described in your earlier thread sounds like a good candidate for adaptive replica selection (see also the blog post about how it works). This was added in 6.1.0 but you were on 5.6.2 when you last asked. Can you upgrade?

Hi David,

We are currently unable to upgrade elastisearch to version 6.x because our application is heavily based on Transport Client (api java) and we use types. With only 1 replica we have the whole cluster balanced. Why does the cluster get unbalanced when we use 2 replicas?

If you go from 1 to 2 replicas you are increasing the amount of data stored on disk by 50%. Could it be that this leads to increased disk I/O, which results in higher load?

The disks are ssd. I think the influence would be small, especially in cpu consumption. We have machines with 64Gb of RAM and the jvm using 30Gb (near 32Gb for mmap)

This is certainly a puzzle. I wonder if perhaps searching one of your indices is more expensive than the other one, and the allocation of the shards is such that the expensive shards are more concentrated on the few problematic nodes. Elasticsearch balances the cluster based on shard count, considering all shards as basically equal, so this is possible.

I also wonder whether the busier nodes are busier simply because they see more searches (and related activity) because of some kind of routing oddity. You can find this out by looking at the nodes statistics:

GET /_nodes/stats?filter_path=nodes.*.indices.search

This shows cumulative statistics about each node, so you need to look at two consecutive outputs some time apart and take deltas. Is there any significant difference here between the busier nodes and the quieter ones?

One effect of increasing the data on each node is that it's easier to run out of filesystem cache, which certainly can affect performance. The balanced performance with 1 replica might be a distraction - it's puzzling that there are these hotspots with any number of replicas.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.