Hi,
There is no result for following 3 metrics:
...
| All | Min Throughput | index-append | | docs/s |
| All | Median Throughput | index-append | | docs/s |
| All | Max Throughput | index-append | | docs/s |
...
I bet you have a 100% error rate during indexing and that's why you see no throughput numbers. Can you please upgrade to Rally 0.5.1 (I just released it) which will now show you error rates. For example here you can see a zero error rate:
| All | Min Throughput | index-append | 661.248 | docs/s |
| All | Median Throughput | index-append | 661.248 | docs/s |
| All | Max Throughput | index-append | 661.248 | docs/s |
| All | error rate | index-append | 0 | % |
If you see any number > 0% you run into errors during indexing and then you need to inspect the log files of Elasticsearch (and maybe also Rally) to see what happened.
I tried with 0.5.1, here is the report:
the error rate is 0, and some metrics shows negative number, is that normal?
| Lap | Metric | Operation | Value | Unit |
|------:|------------------------------:|-------------:|----------:|-------
| All | Indexing time | | -1.24765 | min |
| All | Merge time | | -6.60647 | min |
| All | Refresh time | | -0.865483 | min |
| All | Flush time | | -0.1632 | min |
| All | Merge throttle time | | -0.582767 | min |
| All | Total Young Gen GC | | 6.092 | s |
| All | Total Old Gen GC | | 0.117 | s |
| All | Heap used for segments | | 717.885 | MB |
| All | Heap used for doc values | | 5.59931 | MB |
| All | Heap used for terms | | 522.604 | MB |
| All | Heap used for norms | | 0.268616 | MB |
| All | Heap used for points | | 25.2797 | MB |
| All | Heap used for stored fields | | 164.133 | MB |
| All | Segment count | | 3100 | |
| All | Min Throughput | index-append | | docs/s |
| All | Median Throughput | index-append | | docs/s |
| All | Max Throughput | index-append | | docs/s |
| All | error rate | index-append | 0 | % |
| All | Min Throughput | force-merge | 0.0790135 | ops/s |
| All | Median Throughput | force-merge | 0.0790135 | ops/s |
| All | Max Throughput | force-merge | 0.0790135 | ops/s |
| All | 100th percentile latency | force-merge | 12656 | ms |
| All | 100th percentile service time | force-merge | 12656 | ms |
| All | error rate | force-merge | 0 | % |
no this is completely unexpected and basically all these metrics are weird:
Negative times are completely unexpected
Several metrics are missing (e.g. CPU usage, index size, ...) - Did you run instead with --pipeline=benchmark-only?
There is no way the heap usage metrics are that high during this benchmark (compare 160MB for stored fields reported vs. 0.6 MB below).
Segment count went through the roof (3100).
We run these benchmarks now for more than a year every day and I've never seen such metrics (So I am very interested in what happened here).
I just ran esrally --challenge=append-fast-no-conflicts --pipeline=from-distribution --offline --distribution-version=5.2.0 (Rally 0.5.1) on my machine and here are my results:
| Lap | Metric | Operation | Value | Unit |
|------:|-------------------------------:|-------------:|----------:|-------:|
| All | Indexing time | | 30.5668 | min |
| All | Merge time | | 5.55988 | min |
| All | Refresh time | | 0.652883 | min |
| All | Merge throttle time | | 0.917367 | min |
| All | Median CPU usage | | 674.4 | % |
| All | Total Young Gen GC | | 14.687 | s |
| All | Total Old Gen GC | | 2.509 | s |
| All | Index size | | 2.59237 | GB |
| All | Totally written | | 10.7294 | GB |
| All | Heap used for segments | | 14.55 | MB |
| All | Heap used for doc values | | 0.128994 | MB |
| All | Heap used for terms | | 13.4386 | MB |
| All | Heap used for norms | | 0.0874634 | MB |
| All | Heap used for points | | 0.24948 | MB |
| All | Heap used for stored fields | | 0.64547 | MB |
| All | Segment count | | 113 | |
| All | Min Throughput | index-append | 35962.8 | docs/s |
| All | Median Throughput | index-append | 36598.4 | docs/s |
| All | Max Throughput | index-append | 36871 | docs/s |
| All | 50.0th percentile latency | index-append | 936.697 | ms |
| All | 90.0th percentile latency | index-append | 1553.48 | ms |
| All | 99.0th percentile latency | index-append | 2537.81 | ms |
| All | 100th percentile latency | index-append | 3073.83 | ms |
| All | 50.0th percentile service time | index-append | 936.697 | ms |
| All | 90.0th percentile service time | index-append | 1553.48 | ms |
| All | 99.0th percentile service time | index-append | 2537.81 | ms |
| All | 100th percentile service time | index-append | 3073.83 | ms |
| All | error rate | index-append | 0 | % |
| All | Min Throughput | force-merge | 0.378016 | ops/s |
| All | Median Throughput | force-merge | 0.378016 | ops/s |
| All | Max Throughput | force-merge | 0.378016 | ops/s |
| All | 100th percentile latency | force-merge | 2645.37 | ms |
| All | 100th percentile service time | force-merge | 2645.37 | ms |
| All | error rate | force-merge | 0 | % |
Can you please share a few things?
The exact command you ran (I assume this is still esrally --challenge=append-fast-no-conflicts --pipeline=from-distribution --offline --distribution-version=5.2.0 but just to be sure)
The output of esrally list facts --target-hosts=127.0.0.1.
Can you please also try to run on another machine?
I took a rerun and the issue not reproducible.
But there is still no Throughput with 0 error rate
Lap
Metric
Operation
Value
Unit
All
Indexing time
35.9091
min
All
Merge time
9.24398
min
All
Refresh time
0.125467
min
All
Merge throttle time
1.23803
min
All
Median CPU usage
2504.5
%
All
Total Young Gen GC
13.846
s
All
Total Old Gen GC
2.08
s
All
Index size
3.46825
GB
All
Totally written
19.0826
GB
All
Heap used for segments
22.2556
MB
All
Heap used for doc values
0.146637
MB
All
Heap used for terms
21.1231
MB
All
Heap used for norms
0.103088
MB
All
Heap used for points
0.24836
MB
All
Heap used for stored fields
0.634415
MB
All
Segment count
126
All
Min Throughput
index-append
docs/s
All
Median Throughput
index-append
docs/s
All
Max Throughput
index-append
docs/s
All
error rate
index-append
0
%
All
Min Throughput
force-merge
0.070132
ops/s
All
Median Throughput
force-merge
0.070132
ops/s
All
Max Throughput
force-merge
0.070132
ops/s
All
100th percentile latency
force-merge
14258.8
ms
All
100th percentile service time
force-merge
14258.8
ms
All
error rate
force-merge
0
%
Here is the output of esrally list facts --target-hosts=127.0.0.1
{
"environment": {
"type": "TODO: Provide one of 'cloud', 'bare-metal' + details about the environment (EC2, instance type)"
},
"software": {
"os_name": "Linux",
"rally_version": "0.5.1",
"jvm_version": "1.8.0_65-b17",
"distribution_version": "TODO: Provide Elasticsearch distribution version",
"jvm_vendor": "Oracle Corporation",
"os_version": "3.10.0-327.el7.x86_64"
},
"hardware": {
"disk": [
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/mapper/centos-root",
"file-system": "xfs"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/sdc",
"file-system": "ext4"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/sdb",
"file-system": "ext4"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/mapper/centos-home",
"file-system": "xfs"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/sda1",
"file-system": "xfs"
}
],
"memory": "125gb",
"cpu_model": "Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz"
}
}
ok, these metrics are much closer to what I'd expect. Based on the kernel version I guess you are on CentOS 7?
Can you please send me the log files from Rally and Elasticsearch?
Rally's log files are located in ~/.rally/logs (just send me everything in there) and the log files of Elasticsearch are in ~/.rally/benchmarks/races/. You need to go to that directory and choose the most recent timestamp. If you ran it just now, it is in a folder with a name that starts with 2017-03-02.
You ran with pipeline benchmark-only which explains a lot of the weird results we're seeing (e.g. no CPU and I/O data, high heap)
I also know now why you don't see any throughput numbers despite a zero error rate. Rally is in a warmup phase for the first two minutes of the benchmark and will then switch to normal measurement mode but the "problem" is your cluster is so fast that it finishes the benchmark before the two minutes are up. Hence, there are no measurement samples and you also don't see any throughput numbers.
You can do now two things:
Reduce the warmup time period but that requires that you change the track itself. Besides, a two minute benchmark is worthless anyway. On our infrastructure, it takes longer than that but I may consider to add more data to the geonames track.
Run a different benchmark. I don't know what you're after but you should benchmark with a data set that matches your use case. We have a lot more (see https://elasticsearch-benchmarks.elastic.co/). You could, for example, run with the PMC track if you're interested in performance of full text search (just specify --track=pmc).
I also recommend that you bring the cluster to a defined state if you have to use the benchmark-only pipeline. Ideally, you do the following before each race:
Thanks a lot for your investigation. I'm trying with your suggestions.
could you tell me what will rally do in warm up time.
Can i simply edit the track.json change the warmup-time-period to a smaller number?
I tried 100, It works, But at what cost?
every system needs some time after startup for stabilization. Otherwise you are measuring all the noise that occurs during startup. So, the "cost" is that you will get numbers from a system that is not yet stable and you should not trust them.
Here is a visualization of the throughput (from another benchmark):
Just to be clear: Rally takes all samples starting from the first second of the benchmark. However, it will label them differently. The blue ones above are warmup samples and the red ones are the actual measurement samples. Only the red ones will show up in the summary report on the command line. You can also see how unstable the throughput numbers are in the beginning and this is the reason why Rally has a warmup phase.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.