Esrally benchmarking elastic cluster running on kubernetes on AWS

Hi All,

1 esrally pods
1 master es pod
1 data es pod

All these are running on separate ec2 instances.
I am using i4i.16xLarge instance type.

i4i.16xLarge

64 vCPUs
512 gb ram

BULK SIZE 1000
BULK INDEXING CLIENT 64
NUMBER OF SHARDS 2
refresh_interval -1
track nyc_taxis

I am looking at indexing throughput

I am storing data on memory on data node. No IO bottle neck for IO.
I am looking to saturate CPU usage.

i see the CPU Usage is not going beyond 65% even increasing bulk size further and no gain in throughput as well.

I see no significant gain in throughput with BULK SIZE 2000, 3000, 4000, 5000, 10000.

How is BULK INDEXING CLIENT and Elasticsearch write thread related?

Elasticsearch take the write thread size based on limit or requests set on resources in kubernetes.?

I don't see any bottle neck on elastic cluster side but i see the io hitting 100% on esrally node for half indexing duration. Kindly help

CPU USAGE DATA NODE

CLIENT SIDE IO and NW USAGE

These information are collected with SAR tool. I checked the stats at master and data i don't see any bottle neck.

I only see the bottle neck at rally side i.e IO bottle neck for half indexing duration. The storage for esrally is NVMe with 32000 IOPS. and data side i am using memory as storage. I haven't seen this issue with 4x and 8x instance.

Hi Amit,

By indexing to 2 shards on a single node, you can expect the cluster to have at most 2 concurrent write threads allocated to indexing. With 64 vCPUs, the write thread pool is automatically sized at 64. To saturate the cluster:

  1. Set the number of shards to 64.
  2. Set the number of indexing clients to 2x the number of node CPU cores, 128 in this case.

You should expect a big difference in CPU usage after making these changes.

Thank you,
Jason

Thankyou @json for Reply

Hi Json,

Few queries:-

  1. How Hyperthreading play role in indexing performance? As i see the number of processors is taken as the vCPUs but not the physical core.

  2. The bulk thread pool is sized to the number of vCPUs . With Hyperthreading on each core has two threads enabled and doubling the indexing client will further have 2 concurrent indexing thread mapped to one write thread. did i get it right?

  3. The JVM based getRunTimeProcessor() take the processors count from container. If the resources have only request for cpu defined in yaml the number of processor will be taken as resources.requests.cpu and if limit is defined than number of processor will be taken as resources.limit.cpu. is it correct as per my understanding?

  4. if i have 3 data node, with 1 master and 1 esrally each with 8vCPUs then i will have 8*3= 24 vCPUs for data nodes each with 8 write thread. So in this case How shards will be distributed across each data node if i take 8 shards for each data node in order to saturate cluster. Kindly give some idea on how to approach in case of multi data nodes with number of shards , Bulk indexing client

Hello Amit, I'll answer inline:

  1. How Hyperthreading play role in indexing performance? As i see the number of processors is taken as the vCPUs but not the physical core.

The role hyperthreading plays in indexing performance should be evaluated on a case-by-case basis. You can generally expect a hyperthreaded core not to perform as well as a physical core, an aspect that might only matter if the node is CPU bound.

  1. The bulk thread pool is sized to the number of vCPUs . With Hyperthreading on each core has two threads enabled and doubling the indexing client will further have 2 concurrent indexing thread mapped to one write thread. did i get it right?

Doubling the number of indexing clients is something we do with Rally for saturating the target's write thread pool. ES will use a vCPU, as it is presented by the system, per Java thread. Hyperthreading is only a factor as it relates to whatever performance impact is induced by the hyperthreading feature itself.

  1. The JVM based getRunTimeProcessor() take the processors count from container. If the resources have only request for cpu defined in yaml the number of processor will be taken as resources.requests.cpu and if limit is defined than number of processor will be taken as resources.limit.cpu. is it correct as per my understanding?

The number of processors will be whatever the container sees. Limits will be invoked through kernel cgroups and any time a container reaches its allocated limit, then the process will be forced to wait. We typically refer to this behavior as CPU throttling.

  1. if i have 3 data node, with 1 master and 1 esrally each with 8vCPUs then i will have 8*3= 24 vCPUs for data nodes each with 8 write thread. So in this case How shards will be distributed across each data node if i take 8 shards for each data node in order to saturate cluster. Kindly give some idea on how to approach in case of multi data nodes with number of shards , Bulk indexing client

This is why we benchmark. :smile: Replicas need to factor in for a multi-node cluster. Each replica performs the same indexing functions as their primaries, therefore, it might only make sense for a 12 primary, 1 replica scheme in this scenario. It very much depends on the number of indices and search behavior. With 12 primaries, you may want to keep with 2*#vCPUs given that the cluster may end up placing more than 4 primaries on a single node. The nodes will use the write queue as needed and we want to keep those queues >1 if we are to keep the cluster saturated.

Thank you,
Jason

Hi @json
Thankyou for replying

I tried running with 64 shards and 128 bulk indexing client and 1000 bulk size.

CPU Utilization on data node.

Still i see IO bottle neck on client side i.e esrally.

Q. Does this IO bottle neck on client side for approx half indexing duration has any impact on indexing performance?

with further increase in bulk size to 2000
i get below warning

 [[WARNING] No throughput metrics available for [index-append]. Likely cause: The benchmark ended already during warmup](https://discuss.elastic.co/t/warning-no-throughput-metrics-available-for-index-append-likely-cause-the-benchmark-ended-already-during-warmup/306935)

as i feel indexing finishes before warm up tiime.

Any further sugesstion?

Hi @json
Any update on above queries?
Thanks

Hi Amit,

I wonder if what you are seeing on the client side is the data generation and warm-up phase of the benchmark, though there is some overlap. You want to see a saturated write thread pool cluster side. With all the storage in memory, indexing is going to be very fast. You might even need to go beyond a 10000 bulk size along with expanding your dataset in order to collect metrics.

Hi @json

Thankyou for Replying.

I am able to saturate
4xLarge with 1000 bulk size.
8xLarge with 2000 bulk size.
16xLarge :- not able to saturate as i move beyond 1400 BS i get below warning.
[[WARNING] No throughput metrics available for [index-append]. Likely cause: The benchmark ended already during warmup]

Reason:- Indexing finishes before warmup time.

Q1. Does the IO bottle neck on client side for approx. half indexing duration has any impact on indexing performance?
There is some overlap for indexing duration as well. Does this have any impact on indexing performance. As client side i have NVMe storage.

Q2. for 16xlarge I am getting no metrics warning with 0 error rate, How to overcome no metric warning?
I went through the previous issues reported. I figured out people suggesting to make warmup time 0. Does this have impact on performance result? Any other work around for this issue.