No Throughput result in the summary report

Hi,
There is no result for following 3 metrics:
...
| All | Min Throughput | index-append | | docs/s |
| All | Median Throughput | index-append | | docs/s |
| All | Max Throughput | index-append | | docs/s |
...

CMD:
esrally --challenge=append-fast-no-conflicts --pipeline=from-distribution --offline --distribution-version=5.2.0

Could you help, thanks!

Hi @loganw,

I bet you have a 100% error rate during indexing and that's why you see no throughput numbers. Can you please upgrade to Rally 0.5.1 (I just released it) which will now show you error rates. For example here you can see a zero error rate:

|   All |                  Min Throughput |                   index-append |     661.248 | docs/s |
|   All |               Median Throughput |                   index-append |     661.248 | docs/s |
|   All |                  Max Throughput |                   index-append |     661.248 | docs/s |
|   All |                      error rate |                   index-append |           0 |      % |

If you see any number > 0% you run into errors during indexing and then you need to inspect the log files of Elasticsearch (and maybe also Rally) to see what happened.

Daniel

Hi Daniel,

I tried with 0.5.1, here is the report:
the error rate is 0, and some metrics shows negative number, is that normal?

| Lap | Metric | Operation | Value | Unit |
|------:|------------------------------:|-------------:|----------:|-------:expressionless:
| All | Indexing time | | -1.24765 | min |
| All | Merge time | | -6.60647 | min |
| All | Refresh time | | -0.865483 | min |
| All | Flush time | | -0.1632 | min |
| All | Merge throttle time | | -0.582767 | min |
| All | Total Young Gen GC | | 6.092 | s |
| All | Total Old Gen GC | | 0.117 | s |
| All | Heap used for segments | | 717.885 | MB |
| All | Heap used for doc values | | 5.59931 | MB |
| All | Heap used for terms | | 522.604 | MB |
| All | Heap used for norms | | 0.268616 | MB |
| All | Heap used for points | | 25.2797 | MB |
| All | Heap used for stored fields | | 164.133 | MB |
| All | Segment count | | 3100 | |
| All | Min Throughput | index-append | | docs/s |
| All | Median Throughput | index-append | | docs/s |
| All | Max Throughput | index-append | | docs/s |
| All | error rate | index-append | 0 | % |
| All | Min Throughput | force-merge | 0.0790135 | ops/s |
| All | Median Throughput | force-merge | 0.0790135 | ops/s |
| All | Max Throughput | force-merge | 0.0790135 | ops/s |
| All | 100th percentile latency | force-merge | 12656 | ms |
| All | 100th percentile service time | force-merge | 12656 | ms |
| All | error rate | force-merge | 0 | % |

Hi @loganw,

no this is completely unexpected and basically all these metrics are weird:

  • Negative times are completely unexpected
  • Several metrics are missing (e.g. CPU usage, index size, ...) - Did you run instead with --pipeline=benchmark-only?
  • There is no way the heap usage metrics are that high during this benchmark (compare 160MB for stored fields reported vs. 0.6 MB below).
  • Segment count went through the roof (3100).

We run these benchmarks now for more than a year every day and I've never seen such metrics (So I am very interested in what happened here).

I just ran esrally --challenge=append-fast-no-conflicts --pipeline=from-distribution --offline --distribution-version=5.2.0 (Rally 0.5.1) on my machine and here are my results:

|   Lap |                         Metric |    Operation |     Value |   Unit |
|------:|-------------------------------:|-------------:|----------:|-------:|
|   All |                  Indexing time |              |   30.5668 |    min |
|   All |                     Merge time |              |   5.55988 |    min |
|   All |                   Refresh time |              |  0.652883 |    min |
|   All |            Merge throttle time |              |  0.917367 |    min |
|   All |               Median CPU usage |              |     674.4 |      % |
|   All |             Total Young Gen GC |              |    14.687 |      s |
|   All |               Total Old Gen GC |              |     2.509 |      s |
|   All |                     Index size |              |   2.59237 |     GB |
|   All |                Totally written |              |   10.7294 |     GB |
|   All |         Heap used for segments |              |     14.55 |     MB |
|   All |       Heap used for doc values |              |  0.128994 |     MB |
|   All |            Heap used for terms |              |   13.4386 |     MB |
|   All |            Heap used for norms |              | 0.0874634 |     MB |
|   All |           Heap used for points |              |   0.24948 |     MB |
|   All |    Heap used for stored fields |              |   0.64547 |     MB |
|   All |                  Segment count |              |       113 |        |
|   All |                 Min Throughput | index-append |   35962.8 | docs/s |
|   All |              Median Throughput | index-append |   36598.4 | docs/s |
|   All |                 Max Throughput | index-append |     36871 | docs/s |
|   All |      50.0th percentile latency | index-append |   936.697 |     ms |
|   All |      90.0th percentile latency | index-append |   1553.48 |     ms |
|   All |      99.0th percentile latency | index-append |   2537.81 |     ms |
|   All |       100th percentile latency | index-append |   3073.83 |     ms |
|   All | 50.0th percentile service time | index-append |   936.697 |     ms |
|   All | 90.0th percentile service time | index-append |   1553.48 |     ms |
|   All | 99.0th percentile service time | index-append |   2537.81 |     ms |
|   All |  100th percentile service time | index-append |   3073.83 |     ms |
|   All |                     error rate | index-append |         0 |      % |
|   All |                 Min Throughput |  force-merge |  0.378016 |  ops/s |
|   All |              Median Throughput |  force-merge |  0.378016 |  ops/s |
|   All |                 Max Throughput |  force-merge |  0.378016 |  ops/s |
|   All |       100th percentile latency |  force-merge |   2645.37 |     ms |
|   All |  100th percentile service time |  force-merge |   2645.37 |     ms |
|   All |                     error rate |  force-merge |         0 |      % |

Can you please share a few things?

  • The exact command you ran (I assume this is still esrally --challenge=append-fast-no-conflicts --pipeline=from-distribution --offline --distribution-version=5.2.0 but just to be sure)
  • The output of esrally list facts --target-hosts=127.0.0.1.

Can you please also try to run on another machine?

Daniel

Hi Daniel,

I took a rerun and the issue not reproducible.
But there is still no Throughput with 0 error rate

Lap Metric Operation Value Unit
All Indexing time 35.9091 min
All Merge time 9.24398 min
All Refresh time 0.125467 min
All Merge throttle time 1.23803 min
All Median CPU usage 2504.5 %
All Total Young Gen GC 13.846 s
All Total Old Gen GC 2.08 s
All Index size 3.46825 GB
All Totally written 19.0826 GB
All Heap used for segments 22.2556 MB
All Heap used for doc values 0.146637 MB
All Heap used for terms 21.1231 MB
All Heap used for norms 0.103088 MB
All Heap used for points 0.24836 MB
All Heap used for stored fields 0.634415 MB
All Segment count 126
All Min Throughput index-append docs/s
All Median Throughput index-append docs/s
All Max Throughput index-append docs/s
All error rate index-append 0 %
All Min Throughput force-merge 0.070132 ops/s
All Median Throughput force-merge 0.070132 ops/s
All Max Throughput force-merge 0.070132 ops/s
All 100th percentile latency force-merge 14258.8 ms
All 100th percentile service time force-merge 14258.8 ms
All error rate force-merge 0 %

Here is the output of esrally list facts --target-hosts=127.0.0.1
{
"environment": {
"type": "TODO: Provide one of 'cloud', 'bare-metal' + details about the environment (EC2, instance type)"
},
"software": {
"os_name": "Linux",
"rally_version": "0.5.1",
"jvm_version": "1.8.0_65-b17",
"distribution_version": "TODO: Provide Elasticsearch distribution version",
"jvm_vendor": "Oracle Corporation",
"os_version": "3.10.0-327.el7.x86_64"
},
"hardware": {
"disk": [
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/mapper/centos-root",
"file-system": "xfs"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/sdc",
"file-system": "ext4"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/sdb",
"file-system": "ext4"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/mapper/centos-home",
"file-system": "xfs"
},
{
"type": "TODO: Provide one of 'ssd', 'spinning'",
"device": "/dev/sda1",
"file-system": "xfs"
}
],
"memory": "125gb",
"cpu_model": "Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz"
}
}


[INFO] SUCCESS (took 0 seconds)

Hi @loganw,

ok, these metrics are much closer to what I'd expect. Based on the kernel version I guess you are on CentOS 7?

Can you please send me the log files from Rally and Elasticsearch?

Rally's log files are located in ~/.rally/logs (just send me everything in there) and the log files of Elasticsearch are in ~/.rally/benchmarks/races/. You need to go to that directory and choose the most recent timestamp. If you ran it just now, it is in a folder with a name that starts with 2017-03-02.

I will send you a PM with my email address.

Daniel

Hi @loganw,

thanks for the logs. Some observations:

  1. You have a hell of a cluster! :slight_smile:
  2. You ran with pipeline benchmark-only which explains a lot of the weird results we're seeing (e.g. no CPU and I/O data, high heap)

I also know now why you don't see any throughput numbers despite a zero error rate. Rally is in a warmup phase for the first two minutes of the benchmark and will then switch to normal measurement mode but the "problem" is your cluster is so fast that it finishes the benchmark before the two minutes are up. Hence, there are no measurement samples and you also don't see any throughput numbers.

You can do now two things:

  1. Reduce the warmup time period but that requires that you change the track itself. Besides, a two minute benchmark is worthless anyway. On our infrastructure, it takes longer than that but I may consider to add more data to the geonames track.
  2. Run a different benchmark. I don't know what you're after but you should benchmark with a data set that matches your use case. We have a lot more (see https://elasticsearch-benchmarks.elastic.co/). You could, for example, run with the PMC track if you're interested in performance of full text search (just specify --track=pmc).

My take away is that Rally should print a warning or something else in this case so you don't have to guess. I have opened https://github.com/elastic/rally/issues/246.

I also recommend that you bring the cluster to a defined state if you have to use the benchmark-only pipeline. Ideally, you do the following before each race:

  1. Delete all indices
  2. Run fstrim if you are on SSDs (see https://www.elastic.co/blog/is-your-elasticsearch-trimmed for the reasons)
  3. Restart the cluster

Depending on your environment there may be more things that you should do but these are the most important ones to get better repeatable results.

Thank you for providing the logs, otherwise this would have taken me much longer.

Daniel

1 Like

Hi @danielmitterdorfer

Thanks a lot for your investigation. I'm trying with your suggestions.

could you tell me what will rally do in warm up time.
Can i simply edit the track.json change the warmup-time-period to a smaller number?
I tried 100, It works, But at what cost?

Thanks,

Hi @loganw,

every system needs some time after startup for stabilization. Otherwise you are measuring all the noise that occurs during startup. So, the "cost" is that you will get numbers from a system that is not yet stable and you should not trust them.

Here is a visualization of the throughput (from another benchmark):

Just to be clear: Rally takes all samples starting from the first second of the benchmark. However, it will label them differently. The blue ones above are warmup samples and the red ones are the actual measurement samples. Only the red ones will show up in the summary report on the command line. You can also see how unstable the throughput numbers are in the beginning and this is the reason why Rally has a warmup phase.

Daniel

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.