I am currently evaluating Elasticsearch for a HPC related data lake infrastructure. For that, we are currently using rally benchmarker, especially with the geonames and nyc taxis.
Since our HPC environment is batch job dominated, we care more about throughput scalability than the actual latency as it is very unlikely to become a bottleneck in computing-oriented jobs.
So those tracks are designed for our Elasticsearch benchmarks page and they are designed to help us find performance regressions in Elasticsearch itself. So we care more about stability than absolute numbers and scalability. That said:
For indexing, you'll notice that we never throttle and go as fast as possible.
For querying, we are more interested in latency than throughput and throttling gets use more stable latencies. Regarding default_no_target we made some experiments in the NYC track to see the effect of disabling throttling but it was not conclusive and we pause dit.
Since our HPC environment is batch job dominated, we care more about throughput scalability than the actual latency as it is very unlikely to become a bottleneck in computing-oriented jobs.
Do you care about queries throughput? Or mostly for about indexing throughput like most of our users? If you care more about indexing, then our benchmarks could be good starting point. Otherwise, you may need to tweak them.
And, as always, the best benchmark is always going to be a benchmark based on your data and your latency/throughput requirements. Rally has good docs for creating your own track.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.