Gatling test on ES EKS cluster

INS · October 5, 2024, 3:33pm

Maybe someone has some expierence in performing gatling test on Elasticsearch? Indeed I'm interesting on query responses time, I have a cluster build on 10 data nodes(14 CPU handling ES) with 52GB RAMnodes a 3(6CPU) master nodes. During test I didn't met expected response time for 600 rps even for 400 rps. CPU's have been saturated overhead 100%. Also my shard count ~10 GB plus 1 replica. So this data should upload to heap. I don't really understand why ES couldn't upload such data on memory.yes I know it about Rally(we don't have any trouble about indexing data but focus on search/response time specially for touching end users) what I mean response time for 99,999 percentile.

We couldn't observe any overload on heap memory. During such tests we have also queries which are used aggregation over nested fields. (yes I know it this is heavy process) but on such machines it should goes smoothly. We have so small shards on 9 data nodes (indeed this is a small index with 1 replica) NW condition doesn't occur also any load. We have all of these factors on grafana to observe. I don't any idea what we can do for increase performance through Hardware capabilities. We're using SSD premium disk on Azure Cloud.

INS · October 5, 2024, 3:33pm

Added vector-search

Christian_Dahlqvist · October 5, 2024, 3:42pm

Elasticsearch does not cache all data on the heap so increasing heap size does not necessarily improve performance. It instead relies on the operating system page cache. I would recommend looking at your heap usage and try to gradually decrease it so there is more space for the page cache. Ideally you should see no disk I/O when querying. This is the best way to optimise for handling large number of concurrent queries.

Start low and gradually increase the number of concurrent queries. Plot query latency as a function of the number of concurrent queries so you can see how many concurrent queries your current cluster can handle at an acceptable query latency.

INS · October 5, 2024, 4:50pm

Hi Christian
Thx tips,
" try to gradually decrease it so there is more space for the page cache" so if I'll decrease memory so ES cluster will consume much more I/O as You can see on screcn we didn't reach max heap size but Thread poll active and CPU(in our case it should be 22 =(14CPU*3)/2+1 on each data node

Christian_Dahlqvist · October 5, 2024, 5:29pm

I am not sure I understand what you are saying. It looks like you have more heap allocated than may be necessary, so if you are seeing lots of disk I/O (do not see that in the graps you provided) I would try to reduce the heap size and/or add more RAM to the node.

When you increase query throughput it is clear that you are limited by CPU. To try and address that you may want to look into how you shard your data as well as how you query it so you run as efficient queries as possible.

BenTrent · October 7, 2024, 11:35am

To second what @Christian_Dahlqvist is saying.

First, make sure you are not seeing a large number of major page faults. This indicates that ES is having to page in memory off-heap. This possibly indicates that you don't quite have enough off-heap memory. Two options there are increase node memory or reduce JVM allocation to increase off-heap size.

Second, if you do not see a large number of major page faults, it would be interesting to know the KIND of queries you are running. For multiple types of queries, the rule of thumb is fewer segments == the faster the queries. One possible option is to adjust index merging configuration to allow segments to merge more aggressively. Having one segment per CPU core available on the machine might be the most optimal number in your case.

INS · October 8, 2024, 8:46am

yes manually force merge segments brings better performance, but Can I control from index setting max segments per shards on live?

INS · October 9, 2024, 11:13am

@BenTrent How to implement more aggressive config for segments?