I have the following cluster.
Nodes: 6 (48vCPUs and 384GB memory)
Shards: 158
EBS volume: 24TB GP3 type (Provisioned IOPS: 50,000 and 1781 Mb/sec throughput per node)
0 replica.
~11B documents for ~10KB each.
When I benchmark the cluster varying clients from 1 to 150 and target-throughput from1 to 200 I see the CPU utilization under 25%. I looked at the EBS read IOPS and throughput are well under the provisioned limits. I am not able to conclude if the search requests are CPU/IO bound but the latency increase as I increase clients and throughput.
From here, I see that on each node upto 5 shards are searched concurrently.
Is there a limit on number of search threads in a node that can search one particular shard at the same time?
The reason I am asking this question is because the number of vCPU per node is almost 1.8 times as number of shards (27). If the number of threads per node that can search a shard is 1, then at a given time, only 27 of 73 search threads can be active at a time leading to low CPU utilization.
Is the above a fair reasoning behind why I see lower CPU utilization? I am about to increase the replica to 1, to see if that helps but wanted to understand why this could happen.
Also, I am not able to reason why doubling the number of clients generating the same load increase the service time by 2 times. From service point of view, irrespective of the number of clients generating the same load, the service should handle the same right?
Shard size after indexing : 216.5gb. Yes, the number_of_replicas are 0.
I run Script score based KNN queries. No, search benchmark is run after indexing and all segments within a shard merged to 1. Queries are targeting only one index.
OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.
(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )
Opensearch is not supported here, it has some changes in the code made mostly by AWS.
But considering that Opensearch uses a fork of Elasticsearch 7.10, you need to follow the recommendations for the number of shards and memory heap for this version, which basically are.
Your Java Heap should not be lower than 32 GB, you need to stay below compressed oops, in most systems this mean that maximum java heap would be something near 30 GB.
Have a maximum of 20 shards per GB of heap, so with 30 GB of heap you should have a maximum of 600 shards on the node.
But this is about shard configuration, for information about kNN you need to check with Opensearch community.
The machines on my cluster have 32Gb of heap space. I notice that as traffic to my cluster increase, the JVM garbage collection time of the nodes also increase. This is one of the major contributor to increase in latency.
Quick question. Is it normal to see this pattern of increase garbage collection time corresponding to increase in traffic for distributed search systems?
More traffic means more heap needing to be allocated and released in order to handle requests so I would say it is expected for garbage collection to increase.
The interesting observation is that the heap usage is hovering around 60% during benchmark and the garbage collection count and time increase corresponding to traffic.
I have no experience with Opensearch and do not know how it and the associated plugins behave from a performance perspective nor how this affects garbage collection. I would recommend you reach out to the Opensearch community.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.