Reasoning behind Geonames Rally Design

lquenti · May 12, 2023, 9:03am

Hi,

I am currently evaluating Elasticsearch for a HPC related data lake infrastructure. For that, we are currently using rally benchmarker, especially with the geonames and nyc taxis.

Since our HPC environment is batch job dominated, we care more about throughput scalability than the actual latency as it is very unlikely to become a bottleneck in computing-oriented jobs.

While analyzing the geonames schedule I realize that most tasks seem to be throttled by the target-throughput.

I would love to learn about the reasoning behind it. Especially:

Why did you limit it? How does it improve the result data?
If I would would create a custom track removing those limitations, what would I have to consider when analyzing the results?

It really suprised me, since I am not that knowledgable about benchmarks.

Thank you so much
Lars

PS: If it matters, we use pretty big servers for our ES cluster benchmarks (12 core Xeon, 500GB RAM per node)

Update 2: Especially since I saw that nyc taxis does both default_no_target and default there has to be a reason I dont understand!

Quentin_Pradet · May 12, 2023, 11:04am

Hello,

Good question.

So those tracks are designed for our Elasticsearch benchmarks page and they are designed to help us find performance regressions in Elasticsearch itself. So we care more about stability than absolute numbers and scalability. That said:

For indexing, you'll notice that we never throttle and go as fast as possible.
For querying, we are more interested in latency than throughput and throttling gets use more stable latencies. Regarding default_no_target we made some experiments in the NYC track to see the effect of disabling throttling but it was not conclusive and we pause dit.

Since our HPC environment is batch job dominated, we care more about throughput scalability than the actual latency as it is very unlikely to become a bottleneck in computing-oriented jobs.

Do you care about queries throughput? Or mostly for about indexing throughput like most of our users? If you care more about indexing, then our benchmarks could be good starting point. Otherwise, you may need to tweak them.

And, as always, the best benchmark is always going to be a benchmark based on your data and your latency/throughput requirements. Rally has good docs for creating your own track.

system · June 9, 2023, 11:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rally Benchmark - Which race/benchmark to use for performance testing Elasticsearch rally	2	486	December 27, 2021
Benchmarking High Volumes Elasticsearch rally	2	505	May 11, 2019
Benchmarking ES cluster using larger Rally dataset for multiple parallel indexing Elasticsearch rally	5	871	July 5, 2019
Huge Difference in the throughput for index Elasticsearch rally	3	401	August 2, 2022
A question for result benchmark Elasticsearch rally	2	791	March 27, 2017

Reasoning behind Geonames Rally Design

Related topics