Interpreting results

I've read Are there docs to help interpret the results? and ESRally Metrics Definition but still have some questions on interpreting results, specifically those from a tournament. In the results below, the index-append throughputs are ~10% better for the contender, but indexing/merge/refresh/flush/throttle are waaaaay better.

The baseline is run on 2 r4.xlarges with provisioned IOPS (io1) EBS, while the contender is 2 i3.xlarges using the local NVMe SSD's. Rally is invoked on a r4.xlarge.

I'd expect the index-append throughput rates to be much more favorable on the SSDs as all the other metrics are. What am I missing?

|                         Metric |                          Task |    Baseline |   Contender |     Diff |   Unit |
|-------------------------------:|------------------------------:|------------:|------------:|---------:|-------:|
|                  Indexing time |                               |     160.712 |     69.6534 | -91.0585 |    min |
|                     Merge time |                               |     1973.32 |     82.5694 | -1890.75 |    min |
|                   Refresh time |                               |     211.762 |     12.6196 | -199.143 |    min |
|                     Flush time |                               |     2.50997 |    0.824467 |  -1.6855 |    min |
|            Merge throttle time |                               |     214.459 |     55.5885 | -158.871 |    min |
|                 Min Throughput |                  index-append |     531.094 |     582.043 |  50.9495 | docs/s |
|              Median Throughput |                  index-append |     585.235 |     640.781 |  55.5465 | docs/s |
|                 Max Throughput |                  index-append |      616.81 |      685.16 |  68.3507 | docs/s |
|        50th percentile latency |                  index-append |     5319.24 |      4880.9 | -438.338 |     ms |
|        90th percentile latency |                  index-append |     7861.49 |     6394.39 |  -1467.1 |     ms |
|        99th percentile latency |                  index-append |     10389.6 |     8166.32 | -2223.31 |     ms |
|       100th percentile latency |                  index-append |     11361.2 |     9180.74 | -2180.46 |     ms |
|   50th percentile service time |                  index-append |     5319.24 |      4880.9 | -438.338 |     ms |
|   90th percentile service time |                  index-append |     7861.49 |     6394.39 |  -1467.1 |     ms |
|   99th percentile service time |                  index-append |     10389.6 |     8166.32 | -2223.31 |     ms |
|  100th percentile service time |                  index-append |     11361.2 |     9180.74 | -2180.46 |     ms |
|                     error rate |                  index-append |           0 |           0 |        0 |      % |

The index time related metrics are not measured in Wall clock time and this can be quite confusing. We are aware of this problem and will improve this.

W.r.t. to the difference in indexing throughput which is smaller than expected: Upon first glance, it looks as if the disk is not your bottleneck here. Some ideas:

  • Your documents seem to be relatively large. You should check that you do not saturate the network card. For example, we've experienced this with the pmc track (also relatively large documents) on three node clusters and 1GBit cards. You can check this e.g. with ifstat. The instance types that you've mentioned allow "up to 10GBit" of network bandwidth. 10GBit should definitely be fine but you might saturate a 1GBit link. Also, it is hard to tell what "up to 10GBit" really means for your available network bandwidth during the benchmark.
  • Maybe the bulk size is not optimal so you should vary it. Similarly, vary the number of clients.
  • Try to set the translog to async for the benchmark. Before you enable this in production be aware of the tradeoffs (see ES docs).
  • I don't see any GC times (did you manually remove it here?) but ensure that you have allocated enough heap for Elasticsearch. Usually, indexing does not require a whole lot of heap but the default is likely too little to achieve peak throughput.
  • It is crucial that your document corpus that Rally indexes resides on a SSD. Each client in Rally will read a slice of the data file. On disk level this is a random access pattern and spinning disks are bad at this. This might introduce an accidental client-side bottleneck.
  • You can check utilization of system resources with tools like vmstat, top and iostat.
1 Like

Thank you for this info @danielmitterdorfer. I was running the pmc dataset, and I did remove the GC times to make my post smaller. Just reading your post I realize I did several things wrong in how I set up and ran Rally. I'll set it up properly and give it another go.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.