No Throughput result in the summary report

loganw · March 1, 2017, 3:29am

CMD:
esrally --challenge=append-fast-no-conflicts --pipeline=from-distribution --offline --distribution-version=5.2.0

Could you help, thanks!

danielmitterdorfer · March 1, 2017, 8:42am

Hi @loganw,

I bet you have a 100% error rate during indexing and that's why you see no throughput numbers. Can you please upgrade to Rally 0.5.1 (I just released it) which will now show you error rates. For example here you can see a zero error rate:

|   All |                  Min Throughput |                   index-append |     661.248 | docs/s |
|   All |               Median Throughput |                   index-append |     661.248 | docs/s |
|   All |                  Max Throughput |                   index-append |     661.248 | docs/s |
|   All |                      error rate |                   index-append |           0 |      % |

If you see any number > 0% you run into errors during indexing and then you need to inspect the log files of Elasticsearch (and maybe also Rally) to see what happened.

Daniel

loganw · March 2, 2017, 1:22am

Hi Daniel,

I tried with 0.5.1, here is the report:
the error rate is 0, and some metrics shows negative number, is that normal?

| Lap | Metric | Operation | Value | Unit |
|------:|------------------------------:|-------------:|----------:|-------
| All | Indexing time | | -1.24765 | min |
| All | Merge time | | -6.60647 | min |
| All | Refresh time | | -0.865483 | min |
| All | Flush time | | -0.1632 | min |
| All | Merge throttle time | | -0.582767 | min |
| All | Total Young Gen GC | | 6.092 | s |
| All | Total Old Gen GC | | 0.117 | s |
| All | Heap used for segments | | 717.885 | MB |
| All | Heap used for doc values | | 5.59931 | MB |
| All | Heap used for terms | | 522.604 | MB |
| All | Heap used for norms | | 0.268616 | MB |
| All | Heap used for points | | 25.2797 | MB |
| All | Heap used for stored fields | | 164.133 | MB |
| All | Segment count | | 3100 | |
| All | Min Throughput | index-append | | docs/s |
| All | Median Throughput | index-append | | docs/s |
| All | Max Throughput | index-append | | docs/s |
| All | error rate | index-append | 0 | % |
| All | Min Throughput | force-merge | 0.0790135 | ops/s |
| All | Median Throughput | force-merge | 0.0790135 | ops/s |
| All | Max Throughput | force-merge | 0.0790135 | ops/s |
| All | 100th percentile latency | force-merge | 12656 | ms |
| All | 100th percentile service time | force-merge | 12656 | ms |
| All | error rate | force-merge | 0 | % |

danielmitterdorfer · March 2, 2017, 7:15am

Hi @loganw,

no this is completely unexpected and basically all these metrics are weird:

Negative times are completely unexpected
Several metrics are missing (e.g. CPU usage, index size, ...) - Did you run instead with --pipeline=benchmark-only?
There is no way the heap usage metrics are that high during this benchmark (compare 160MB for stored fields reported vs. 0.6 MB below).
Segment count went through the roof (3100).

We run these benchmarks now for more than a year every day and I've never seen such metrics (So I am very interested in what happened here).

I just ran esrally --challenge=append-fast-no-conflicts --pipeline=from-distribution --offline --distribution-version=5.2.0 (Rally 0.5.1) on my machine and here are my results:

|   Lap |                         Metric |    Operation |     Value |   Unit |
|------:|-------------------------------:|-------------:|----------:|-------:|
|   All |                  Indexing time |              |   30.5668 |    min |
|   All |                     Merge time |              |   5.55988 |    min |
|   All |                   Refresh time |              |  0.652883 |    min |
|   All |            Merge throttle time |              |  0.917367 |    min |
|   All |               Median CPU usage |              |     674.4 |      % |
|   All |             Total Young Gen GC |              |    14.687 |      s |
|   All |               Total Old Gen GC |              |     2.509 |      s |
|   All |                     Index size |              |   2.59237 |     GB |
|   All |                Totally written |              |   10.7294 |     GB |
|   All |         Heap used for segments |              |     14.55 |     MB |
|   All |       Heap used for doc values |              |  0.128994 |     MB |
|   All |            Heap used for terms |              |   13.4386 |     MB |
|   All |            Heap used for norms |              | 0.0874634 |     MB |
|   All |           Heap used for points |              |   0.24948 |     MB |
|   All |    Heap used for stored fields |              |   0.64547 |     MB |
|   All |                  Segment count |              |       113 |        |
|   All |                 Min Throughput | index-append |   35962.8 | docs/s |
|   All |              Median Throughput | index-append |   36598.4 | docs/s |
|   All |                 Max Throughput | index-append |     36871 | docs/s |
|   All |      50.0th percentile latency | index-append |   936.697 |     ms |
|   All |      90.0th percentile latency | index-append |   1553.48 |     ms |
|   All |      99.0th percentile latency | index-append |   2537.81 |     ms |
|   All |       100th percentile latency | index-append |   3073.83 |     ms |
|   All | 50.0th percentile service time | index-append |   936.697 |     ms |
|   All | 90.0th percentile service time | index-append |   1553.48 |     ms |
|   All | 99.0th percentile service time | index-append |   2537.81 |     ms |
|   All |  100th percentile service time | index-append |   3073.83 |     ms |
|   All |                     error rate | index-append |         0 |      % |
|   All |                 Min Throughput |  force-merge |  0.378016 |  ops/s |
|   All |              Median Throughput |  force-merge |  0.378016 |  ops/s |
|   All |                 Max Throughput |  force-merge |  0.378016 |  ops/s |
|   All |       100th percentile latency |  force-merge |   2645.37 |     ms |
|   All |  100th percentile service time |  force-merge |   2645.37 |     ms |
|   All |                     error rate |  force-merge |         0 |      % |

Can you please share a few things?

The exact command you ran (I assume this is still esrally --challenge=append-fast-no-conflicts --pipeline=from-distribution --offline --distribution-version=5.2.0 but just to be sure)
The output of esrally list facts --target-hosts=127.0.0.1.

Can you please also try to run on another machine?

Daniel

loganw · March 2, 2017, 7:33am

Hi Daniel,

I took a rerun and the issue not reproducible.
But there is still no Throughput with 0 error rate

Lap	Metric	Operation	Value	Unit
All	Indexing time		35.9091	min
All	Merge time		9.24398	min
All	Refresh time		0.125467	min
All	Merge throttle time		1.23803	min
All	Median CPU usage		2504.5	%
All	Total Young Gen GC		13.846	s
All	Total Old Gen GC		2.08	s
All	Index size		3.46825	GB
All	Totally written		19.0826	GB
All	Heap used for segments		22.2556	MB
All	Heap used for doc values		0.146637	MB
All	Heap used for terms		21.1231	MB
All	Heap used for norms		0.103088	MB
All	Heap used for points		0.24836	MB
All	Heap used for stored fields		0.634415	MB
All	Segment count		126
All	Min Throughput	index-append		docs/s
All	Median Throughput	index-append		docs/s
All	Max Throughput	index-append		docs/s
All	error rate	index-append	0	%
All	Min Throughput	force-merge	0.070132	ops/s
All	Median Throughput	force-merge	0.070132	ops/s
All	Max Throughput	force-merge	0.070132	ops/s
All	100th percentile latency	force-merge	14258.8	ms
All	100th percentile service time	force-merge	14258.8	ms
All	error rate	force-merge	0	%

Here is the output of esrally list facts --target-hosts=127.0.0.1
{
"environment": {
"type": "TODO: Provide one of 'cloud', 'bare-metal' + details about the environment (EC2, instance type)"
},
"software": {
"os_name": "Linux",
"rally_version": "0.5.1",
"jvm_version": "1.8.0_65-b17",
"distribution_version": "TODO: Provide Elasticsearch distribution version",
"jvm_vendor": "Oracle Corporation",
"os_version": "3.10.0-327.el7.x86_64"
},
"hardware": {
"disk": [
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/mapper/centos-root",
"file-system": "xfs"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/sdc",
"file-system": "ext4"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/sdb",
"file-system": "ext4"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/mapper/centos-home",
"file-system": "xfs"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/sda1",
"file-system": "xfs"
}
],
"memory": "125gb",
"cpu_model": "Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz"
}
}

[INFO] SUCCESS (took 0 seconds)

danielmitterdorfer · March 2, 2017, 8:39am

Hi @loganw,

ok, these metrics are much closer to what I'd expect. Based on the kernel version I guess you are on CentOS 7?

Can you please send me the log files from Rally and Elasticsearch?

Rally's log files are located in ~/.rally/logs (just send me everything in there) and the log files of Elasticsearch are in ~/.rally/benchmarks/races/. You need to go to that directory and choose the most recent timestamp. If you ran it just now, it is in a folder with a name that starts with 2017-03-02.

I will send you a PM with my email address.

Daniel

danielmitterdorfer · March 2, 2017, 11:41am

Hi @loganw,

thanks for the logs. Some observations:

You have a hell of a cluster!
You ran with pipeline benchmark-only which explains a lot of the weird results we're seeing (e.g. no CPU and I/O data, high heap)

I also know now why you don't see any throughput numbers despite a zero error rate. Rally is in a warmup phase for the first two minutes of the benchmark and will then switch to normal measurement mode but the "problem" is your cluster is so fast that it finishes the benchmark before the two minutes are up. Hence, there are no measurement samples and you also don't see any throughput numbers.

You can do now two things:

Reduce the warmup time period but that requires that you change the track itself. Besides, a two minute benchmark is worthless anyway. On our infrastructure, it takes longer than that but I may consider to add more data to the geonames track.
Run a different benchmark. I don't know what you're after but you should benchmark with a data set that matches your use case. We have a lot more (see https://elasticsearch-benchmarks.elastic.co/). You could, for example, run with the PMC track if you're interested in performance of full text search (just specify --track=pmc).

My take away is that Rally should print a warning or something else in this case so you don't have to guess. I have opened https://github.com/elastic/rally/issues/246.

I also recommend that you bring the cluster to a defined state if you have to use the benchmark-only pipeline. Ideally, you do the following before each race:

Delete all indices
Run fstrim if you are on SSDs (see https://www.elastic.co/blog/is-your-elasticsearch-trimmed for the reasons)
Restart the cluster

Depending on your environment there may be more things that you should do but these are the most important ones to get better repeatable results.

Thank you for providing the logs, otherwise this would have taken me much longer.

Daniel

loganw · March 3, 2017, 2:08am

Hi @danielmitterdorfer

Thanks a lot for your investigation. I'm trying with your suggestions.

could you tell me what will rally do in warm up time.
Can i simply edit the track.json change the warmup-time-period to a smaller number?
I tried 100, It works, But at what cost?

Thanks,

danielmitterdorfer · March 3, 2017, 7:01am

Hi @loganw,

every system needs some time after startup for stabilization. Otherwise you are measuring all the noise that occurs during startup. So, the "cost" is that you will get numbers from a system that is not yet stable and you should not trust them.

Here is a visualization of the throughput (from another benchmark):

Just to be clear: Rally takes all samples starting from the first second of the benchmark. However, it will label them differently. The blue ones above are warmup samples and the red ones are the actual measurement samples. Only the red ones will show up in the summary report on the command line. You can also see how unstable the throughput numbers are in the beginning and this is the reason why Rally has a warmup phase.

Daniel

system · March 31, 2017, 7:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rally op_metrics throughput is null Elasticsearch rally	1	210	September 22, 2023
The error rate is 100% Elasticsearch rally	5	2540	November 13, 2017
No throughput metrics available for [bulk]. Likely cause: The benchmark ended already during warmup Elasticsearch rally	10	1922	October 4, 2018
New report error:No throughput metrics available for [bulk]. Likely cause: The benchmark ended already during warmup Elasticsearch rally	4	655	December 9, 2019
Interpreting results Elasticsearch rally	3	1288	January 10, 2018

No Throughput result in the summary report

[INFO] SUCCESS (took 0 seconds)

Related topics