It appears that Rally settings for "iterations" and "bulk-size" interact in a way that is not documented:
On the geonames challenge, changing the schedule for append-no-conflicts-index-only by:
deleting "warmup-time-period" (allowing it to default to 0 "warmup-iterations"), and
changing clients from 8 to 1
View the number of documents indexed by querying the indices after running with the command-line options
--keep-cluster-running --preserve-install=true.
The challenge will finish very quickly, and the index will contain only 5000 or documents.
Why is this?
What is the best way to iterate N times over all the records in a data file?
it's maybe not ideally described in the documentation but doing it this way means implicitly that you are running just one iteration (i.e. one bulk request).
One feature that you can use is called laps, e.g. with --laps=10, Rally will run the same benchmark for 10 iterations but without restarting the benchmark candidate. Please do note though that as the first step in the benchmark Rally will (implicitly) delete the indexes that are used in that benchmark so you have reproducible conditions.
As background, we have a data file of representative documents that we want to scale up for performance testing.
So say, if my data file contains 1000 documents, and I configure a batch size of 50, using 1 client, I'd have to run 1000 / 50 = 20 iterations and 1 lap to index that data file completely one time, right?
If I ran 40 iterations, would it run through and index the data file a second time? Would that be any different than 20 iterations but 2 laps?
Does anything change if I use more than 1 client? That is, does running N iterations do N bulk requests independently of the number of clients available to run submit the bulk requests?
this is correct but not the line of thought that you should go down. The usual case is that you let Rally just read the (whole) file by just specifying a warmup time period and let Rally figure out the rest. Iterations are possible for bulk operations but are somewhat unusual. An iteration in that context means basically one execution of one bulk request as Rally's internal file reader always advances (i.e. it reads on bulk worth of lines from the file, for the next call it reads the next bulk worth of lines and so on).
No, it does not loop. It basically either stops after the specified number of iterations or when there are no more data left (not 100% sure though without diving into the code if that even works when you'd use iterations). It would also be different in that respect that if you run multiple laps, Rally does the following:
delete indices affected by the benchmark
bulk index data
start at 1. again as long as there are remaining laps
So you index the same data set over and over again but your index size does not grow. Also, if you index the same data again, Lucene (which is under the hood of Elasticsearch) will notice and can store data more dense than you'd then experience in your production workload.
I think the best you can do at the moment is to generate the largest data file that you have in mind and then really use iterations instead of using the warmup-time-period parameter. This would allow you to control how many bulks you run. You could use Rally's templating support to generate multiple challenges for that on the fly so you do not need to fiddle with the track.json file all the time then.
The following is totally untested but something along those lines should work:
{% set doc_count = 10000000 %}
{% set bulk_size = 5000 %}
{
"short-description": "Demo benchmark",
"description": "Demo benchmark with with different doc sizes",
"indices": [
...
],
"operations": [
{
"name": "index-append",
"operation-type": "index",
"bulk-size": {{bulk_size}}
}
],
"challenges": [
{% set comma = joiner() %}
{% for count_docs_slice in range(1000000, doc_count, 1000000) %}
{% set total_bulks = count_docs_slice / bulk_size %}
{% set warmup_bulks = total_bulks / 10 %}
{% set measurement_bulks = total_bulks - warmup_bulks %}
{{ comma() }}
{
"name": "index-{{total_bulks}}",
"description": "Indexes {{count_docs_slice}} in {{total_bulks}} of {{bulk_size}} docs.",
"index-settings": {
"index.number_of_replicas": 0
},
"schedule": [
{
"operation": "index-append",
"warmup-iterations": {{warmup_bulks}},
"iterations": {{measurement_bulks}}
"clients": 8
}
]
}
{% endfor %}
]
}
It uses the templating support to create challenges for the first 1 million, 2 million, 3 ... 10 million docs and you could run this with --track=your-track-name and --challenge=index-200, --challenge=index-400 and so on. With a little bit of work I am sure this can be even further improved but I hope it is a starting point.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.