Run Rally races against an ES cluster built on OpenShift

Here is how I do benchmarking with my existing ES cluster:

  1. I have an existing ES cluster built on OpenShift environment
  2. The cluster has 5 nodes and each is having exactly same resources
  3. The ES cluster is being exposed with single endpoint, so when visiting from outside the cluster works like a single instance because OpenShift will handle the load balancing.
  4. I created my custom track with my own index data
  5. I ran the race with docker command, like: docker run --rm -v ${pwd}/esrally/.rally:/rally/.rally -v ${pwd}/esrally/reports:/rally/reports elastic/rally:2.10.0 race --pipeline=benchmark-only --target-host=remote.host.com:80 --track=my-custom-track --challenge=default --report-file=/rally/reports/my-custom-track-rally-report.md --report-format=markdown --on-error=abort --offline

But When I tried to run a few rounds of races (with the same version of ES) and compared the result, I found the differences can be very big. For example:

|                                                        Metric |                 Task |         Baseline |     Contender |          Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|---------------------:|-----------------:|--------------:|--------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |                      |     63.0863      |     28.0918   |     -34.9945  |    min |  -55.47% |
|             Min cumulative indexing time across primary shard |                      |      0           |      0        |       0       |    min |    0.00% |
|          Median cumulative indexing time across primary shard |                      |      0.000741667 |      0        |      -0.00074 |    min | -100.00% |
|             Max cumulative indexing time across primary shard |                      |      3.20992     |      3.29123  |       0.08132 |    min |   +2.53% |
|           Cumulative indexing throttle time of primary shards |                      |      0           |      0        |       0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |                      |      0           |      0        |       0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |                      |      0           |      0        |       0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |                      |      0           |      0        |       0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |                      |    123.878       |      1.08262  |    -122.796   |    min |  -99.13% |
|                      Cumulative merge count of primary shards |                      |  24024           |     22        |  -24002       |        |  -99.91% |
|                Min cumulative merge time across primary shard |                      |      0           |      0        |       0       |    min |    0.00% |
|             Median cumulative merge time across primary shard |                      |      0.00045     |      0        |      -0.00045 |    min | -100.00% |
|                Max cumulative merge time across primary shard |                      |     17.5161      |      0.566267 |     -16.9498  |    min |  -96.77% |
|              Cumulative merge throttle time of primary shards |                      |      0.259333    |      0.1108   |      -0.14853 |    min |  -57.28% |
|       Min cumulative merge throttle time across primary shard |                      |      0           |      0        |       0       |    min |    0.00% |
|    Median cumulative merge throttle time across primary shard |                      |      0           |      0        |       0       |    min |    0.00% |
|       Max cumulative merge throttle time across primary shard |                      |      0.134967    |      0.110767 |      -0.0242  |    min |  -17.93% |
|                     Cumulative refresh time of primary shards |                      |     48.9657      |      3.7339   |     -45.2318  |    min |  -92.37% |
|                    Cumulative refresh count of primary shards |                      | 224377           |    984        | -223393       |        |  -99.56% |
|              Min cumulative refresh time across primary shard |                      |      0           |      0        |       0       |    min |    0.00% |
|           Median cumulative refresh time across primary shard |                      |      0.003475    |      0        |      -0.00347 |    min | -100.00% |
|              Max cumulative refresh time across primary shard |                      |      5.46093     |      0.523717 |      -4.93722 |    min |  -90.41% |
|                       Cumulative flush time of primary shards |                      |      6.23043     |      2.11223  |      -4.1182  |    min |  -66.10% |
|                      Cumulative flush count of primary shards |                      |  14090           |    132        |  -13958       |        |  -99.06% |
|                Min cumulative flush time across primary shard |                      |      0           |      0        |       0       |    min |    0.00% |
|             Median cumulative flush time across primary shard |                      |      0.00205     |      0        |      -0.00205 |    min | -100.00% |
|                Max cumulative flush time across primary shard |                      |      0.50635     |      0.337367 |      -0.16898 |    min |  -33.37% |
|                                       Total Young Gen GC time |                      |     69.005       |     71.058    |       2.053   |      s |   +2.98% |
|                                      Total Young Gen GC count |                      |   5992           |   5938        |     -54       |        |   -0.90% |
|                                         Total Old Gen GC time |                      |      1.064       |      0.816    |      -0.248   |      s |  -23.31% |
|                                        Total Old Gen GC count |                      |     15           |     10        |      -5       |        |  -33.33% |
|                                                    Store size |                      |     11.527       |     11.6336   |       0.10655 |     GB |   +0.92% |
|                                                 Translog size |                      |      0.6923      |      0.11595  |      -0.57635 |     GB |  -83.25% |
|                                        Heap used for segments |                      |      3.9797      |      3.93391  |      -0.04579 |     MB |   -1.15% |
|                                      Heap used for doc values |                      |      0.778618    |      0.690968 |      -0.08765 |     MB |  -11.26% |
|                                           Heap used for terms |                      |      2.65974     |      2.70258  |       0.04285 |     MB |   +1.61% |
|                                           Heap used for norms |                      |      0.338989    |      0.346619 |       0.00763 |     MB |   +2.25% |
|                                          Heap used for points |                      |      0           |      0        |       0       |     MB |    0.00% |
|                                   Heap used for stored fields |                      |      0.202354    |      0.193741 |      -0.00861 |     MB |   -4.26% |
|                                                 Segment count |                      |    421           |    402        |     -19       |        |   -4.51% |
|                                   Total Ingest Pipeline count |                      |      0           |      0        |       0       |        |    0.00% |
|                                    Total Ingest Pipeline time |                      |      0           |      0        |       0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |                      |      0           |      0        |       0       |        |    0.00% |

Questions:

  1. Why some of the result data are so different between each race? Even each race are ran on exactly same version of ES, same data set, same track.
  2. How to make sure the report is reflecting the real performance of the ES cluster?
  3. What are the recommended steps to benchmarking a remote ES cluster built on OpenShift?

Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.