Can number of shards per node be the bottleneck in a cluster?

Hi Folks,

I have the following cluster.
Nodes: 6 (48vCPUs and 384GB memory)
Shards: 158
EBS volume: 24TB GP3 type (Provisioned IOPS: 50,000 and 1781 Mb/sec throughput per node)
0 replica.
~11B documents for ~10KB each.

When I benchmark the cluster varying clients from 1 to 150 and target-throughput from1 to 200 I see the CPU utilization under 25%. I looked at the EBS read IOPS and throughput are well under the provisioned limits. I am not able to conclude if the search requests are CPU/IO bound but the latency increase as I increase clients and throughput.

|num_clients|target-throughput|Latency p99(p99.9)|service time p99(p99.9)|
|1|1|86.38|82.2195|
|2|10|69.89|68.14|
|20|50|87.046(105.091)|85.286(103.194)|
|35|100|119.2(129.823)|117.295(127.525)|
|75|100|231.84(249.435)|228.489(247.742)|
|75|200|235.66(263.879)|231.384(260.173)|
|150|200|437.97(476.326)|427.227(469.457)|

From here, I see that on each node upto 5 shards are searched concurrently.

Is there a limit on number of search threads in a node that can search one particular shard at the same time?

The reason I am asking this question is because the number of vCPU per node is almost 1.8 times as number of shards (27). If the number of threads per node that can search a shard is 1, then at a given time, only 27 of 73 search threads can be active at a time leading to low CPU utilization.

Is the above a fair reasoning behind why I see lower CPU utilization? I am about to increase the replica to 1, to see if that helps but wanted to understand why this could happen.

Also, I am not able to reason why doubling the number of clients generating the same load increase the service time by 2 times. From service point of view, irrespective of the number of clients generating the same load, the service should handle the same right?

Please share your thoughts.
Thanks.

Which version of Elasticsearch are you using?

What is the average index and shard size? Are you currently running completely without replicas?

What type of queries are you running? Do you have any concurrent indexing running? Are all queries targeting all indices?

I'm actually using OpenSearch 2.5.

Shard size after indexing : 216.5gb. Yes, the number_of_replicas are 0.
I run Script score based KNN queries. No, search benchmark is run after indexing and all segments within a shard merged to 1. Queries are targeting only one index.

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

I would recommend you address with the Opensearch community as their knn implementation is different from Elasticsearch.

Opensearch is not supported here, it has some changes in the code made mostly by AWS.

But considering that Opensearch uses a fork of Elasticsearch 7.10, you need to follow the recommendations for the number of shards and memory heap for this version, which basically are.

  • Your Java Heap should not be lower than 32 GB, you need to stay below compressed oops, in most systems this mean that maximum java heap would be something near 30 GB.
  • Have a maximum of 20 shards per GB of heap, so with 30 GB of heap you should have a maximum of 600 shards on the node.

But this is about shard configuration, for information about kNN you need to check with Opensearch community.

Yeah, I'll do that.

However, is the increase in latency with increase in number of clients generating the load an expected behavior?

Yes, I would say so. As you increase the load on the cluster you are at some point going to hit bottlenecks and limitations.

It also depends on how you run the benchmark as there could be limitations or bottlenecks on the load generating side as well.

By client bottleneck, do you mean it is not able to generate the load?

|num_clients|target-throughput|Latency p99(p99.9)|service time p99(p99.9)|
|35|100|119.2(129.823)|117.295(127.525)|
|75|100|231.84(249.435)|228.489(247.742)|

I could see in both cases, ~99.5 Ops/sec is reached.

I am not saying the load driver is the isue here, but it is something to always keep an eye out for.

1 Like

The machines on my cluster have 32Gb of heap space. I notice that as traffic to my cluster increase, the JVM garbage collection time of the nodes also increase. This is one of the major contributor to increase in latency.

Quick question. Is it normal to see this pattern of increase garbage collection time corresponding to increase in traffic for distributed search systems?

More traffic means more heap needing to be allocated and released in order to handle requests so I would say it is expected for garbage collection to increase.

Besides what Christian already said, I would check if your HEAP is below compressed oops as this can also impact on the performance.

You can check it with the following request:

GET _nodes/node-name/jvm

And look for the line using_compressed_ordinary_object_pointers, it needs to be true.

Gotcha. Verified that this field is true.

The interesting observation is that the heap usage is hovering around 60% during benchmark and the garbage collection count and time increase corresponding to traffic.

Is this expected?

I have no experience with Opensearch and do not know how it and the associated plugins behave from a performance perspective nor how this affects garbage collection. I would recommend you reach out to the Opensearch community.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.