Rally: compare multiple races / compare in kibana?

Hi all,

I want to evaluate the impact of running elasticsearch as docker container instead of a classic / native installation. So I wanted to run and compare the external car of docker container and classic installation we have and I want to race on geonames, http logs and eventdata tracks.

Also I want to run each race three times and average the values, because in the first two iteration I had differences on indexing time and 50% percentiles of about 3-4% in same testset and on large_filtered_terms I have differences of about 12% between two runs.

I am a bit curious why I have such big differences. Nevertheless I wanted to workaround these by building some average of the results and comparing avg of docker vs avg of native installation.

Before accidently reinventing the wheel I just wanted to ask:

  1. Is there any out of the box way to send the results to a separate elasticsearch instance to access the result data via kibana?

  2. are there any prebuild dashboards?

  3. Is there already a function to compare a baseline to more than one contender? Not racing one on one, just with a full race grid :wink:

  4. Is there already a function to calculate the average result of multiple racers?

  5. What experience do you have about differences in the results when rerunning the test? Is 12% normal?

Thanks a lot,
Andreas

Hello Andreas,

I added the replies to your questions below:

  1. Is there any out of the box way to send the results to a separate elasticsearch instance to access the result data via kibana?

Yes, you can use the metric store feature for that. Please take a look at the documentation for what metrics we collect: https://esrally.readthedocs.io/en/stable/metrics.html and how to configure Rally to send them to Kibana: https://esrally.readthedocs.io/en/stable/configuration.html

  1. are there any prebuild dashboards?

There are no pre-built dashboards. We have it on our roadmap but it’s not a high priority item.
We generally don’t provide dashboards because there are a lot of different possibilities for how you may want to use the data. To create your own dashboard please query rally-results*. Take a look at what metrics we collect. Another tip is to use user tags --user-tag to simplify the filtering.

  1. Is there already a function to compare a baseline to more than one contender? Not racing one on one, just with a full race grid

The tournament functionality is only meant for simple cases. For more complex cases it is best to analyze the data yourself in Kibana.

  1. Is there already a function to calculate the average result of multiple racers?

No, we don’t have a whole bunch of analysis capability in Rally by design; and Kibana (or other tools) can be used to for more complex analysis.

  1. What experience do you have about differences in the results when rerunning the test? Is 12% normal?

12% fluctuation sounds a bit too high to me and might require investigation on your side. Please refer to https://www.elastic.co/blog/seven-tips-for-better-elasticsearch-benchmarks for best practices. Also it might be helpful for you to take a look at our nightly dashboards for what fluctuation we normally see on our nightly runs on bare metal hardware.

Thank you,
Evgenia

thanks a lot for your reply. I configured the tested cluster to push the metrics to a separate monitoring cluster. Also I sen't metricbeat data to that monitoring cluster to have a complete overview over the system during the testruns.

  1. Is the source code / configuration for your nightly dashboards available somewhere to download?

  2. I have some issues understanding the values / fields which are stored in rally-metrics-* index. What is the meaning of following fields?

  • value (Is it a cummulative value which raises which each probe?)
  • meta.took
  • meta.hits

I did not find a description of them in esrally documentation. So If I want to visualize the latency or service_time as histogram over the time for painless_dynamic, which fields do I need to take?
I would filter index rally-metrics-* for "name: latency AND sample-type: normal", but which field contains my data?

  1. When I take the field value as data source, it it looks like the following as histogram:
    . So it looks like a cummulated value, but if I just check the same for index-append, it looks like a current value:
    .

I digged a bit deeper and compared service_time and latency.

Value for latency showed the increasing behavior of above. Value of service_time is also looking also like a current value as expected:

But when I check for the throughput it is only reducing very slightly, so I cannot explain the latency behavior.

BTW, I tested with geonames track

Hello Andreas,

Is the source code / configuration for your nightly dashboards available somewhere to download?

At this point it is not available. We do have a way to generate charts in Rally: https://github.com/elastic/rally/blob/master/esrally/chart_generator.py. Please NOTE at this point this is considered experimental and is thus intentionally undocumented. There is a mode in the chart generator that lets you generate charts for a single combination of a track, challenge, car and node-count. You could try something like esrally generate charts --track=geonames --challenge=append-no-conflicts --chart-type=time-series --node-count=1 --car=4gheap --output-path=output-my-charts.json

value (Is it a cummulative value which raises which each probe?)

It is not cumulative

meta.took

This is what Elasticsearch returns for how long the query took to complete

meta.hits

This is also what Elsticsearch returns for successful hits of the query

So it looks like a cumulated value, but if I just check the same for index-append, it looks like a current value:

Rally reports service time as time it took to process the request form the time it sent it to Elasticsearch and got the reply back. Latency is service time plus extra waiting time of the requests. For more information please take a look at FAQ. From your graph it looks like there might be some contention in your setup.

Thank you,
Evgenia

Hi Evgenia, thanks for your reply.

So do I understand it correctly?
Esrally is querying data at a fixed rate. The elasticsearch backend responds slower than the esrally is querying. So the queue before processing the the query is increasing over the time.

Here again our picture on more current tests:

The operation painless_static is configured this way in rally's append-no-conflicts challenge:

      {
          "operation": "painless_static",
          "clients": 1,
          "warmup-iterations": 200,
          "iterations": 100,
          "target-throughput": 1.5
        },

so target throughput is 1.5 and our system is only capable of about 1.0 to 0.75.
That would explain the raising latency.

  1. Did I understood correctly?
  2. What if my system was capable of a throughput of 5.0. Rally then would say max throughput =1.5 because rally was not requesting more (so just showing the real measurement), or would I see sth. around 5.0 because rally would calculate it based on latency and busy and idle time?
  3. Is somewhere documented which ressources are mainly effecting the throughput? Because load and cpu are not on the limits when the latency is increasing. On first watch I also see no issues on disk and network.

Thanks, Andreas

Yes.

If you set target throughput to 1.5 operations per second then Rally will try to achieve that but will not go beyond that. For a realistic benchmark you should choose a target throughput that you also see in your production system. Say you have an e-commerce website and you have 100 concurrent users and each of them is hitting your search page roughly once every 10 seconds you should see ~ 10 queries being issued per second. If you want to benchmark that scenario you should set a target throughput of 10 (operations per second).

No and I don't think it would be helpful. Every system is different. If you have a slow disk, the disk might be the bottleneck, if you have a fast disk but only one CPU core, then it might be the CPU. Therefore, you should measure this. The USE method is a good start.

Daniel

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.