Rally Track Report Analysis


#1

Hello,

we want to get a measurement on resources that still might be available in our elasticsearch cluster.

To get an idea we run Rally 0.9.:
esrally --track="http_logs" --target-hosts="node1, node2" --pipeline=benchmark-only --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'superuser',basic_auth_password:'...'"

We now have difficulties to interpret the report:

Lap Metric Task Value Unit
All Indexing time 0 min
All Indexing throttle time 0 min
All Merge time 0 min
All Refresh time 0 min
All Flush time 0 min
All Merge throttle time 0 min
All Total Young Gen GC 0 s
All Total Old Gen GC 0 s
All Min Throughput index-append 74604.3 docs/s
All Median Throughput index-append 76929.9 docs/s
All Max Throughput index-append 77214.6 docs/s
All 50th percentile latency index-append 445.248 ms
All 90th percentile latency index-append 693.113 ms
All 99th percentile latency index-append 1090.24 ms
All 99.9th percentile latency index-append 1497.43 ms
All 99.99th percentile latency index-append 1918.44 ms
All 100th percentile latency index-append 2313.48 ms
All 50th percentile service time index-append 445.248 ms
All 90th percentile service time index-append 693.113 ms
All 99th percentile service time index-append 1090.24 ms
All 99.9th percentile service time index-append 1497.43 ms
All 99.99th percentile service time index-append 1918.44 ms
All 100th percentile service time index-append 2313.48 ms
All error rate index-append 0 %
All I CUT term, hourly_agg, index-stats and node stats to match body limitation.
All Min Throughput default 3.33 ops/s
All Median Throughput default 3.36 ops/s
All Max Throughput default 3.38 ops/s
All 50th percentile latency default 69916.2 ms
All 50th percentile latency default 69916.2 ms
All 90th percentile latency default 108802 ms
All 99th percentile latency default 117133 ms
All 100th percentile latency default 118069 ms
All 50th percentile service time default 285.997 ms
All 90th percentile service time default 344.239 ms
All 99th percentile service time default 452.941 ms
All 100th percentile service time default 547.525 ms
All error rate default 0 %
All Min Throughput range 1.76 ops/s
All Median Throughput range 1.78 ops/s
All Max Throughput range 1.79 ops/s
All 50th percentile latency range 12738.3 ms
All 90th percentile latency range 19092.1 ms
All 99th percentile latency range 20829.9 ms
All 100th percentile latency range 21039.9 ms
All 50th percentile service time range 559.023 ms
All 90th percentile service time range 665.189 ms
All 99th percentile service time range 852.438 ms
All 100th percentile service time range 1131.42 ms
All error rate range 0 %
All Min Throughput scroll 19.45 ops/s
All Median Throughput scroll 20.25 ops/s
All Max Throughput scroll 20.84 ops/s
All 50th percentile latency scroll 228750 ms
All 90th percentile latency scroll 313780 ms
All 99th percentile latency scroll 332907 ms
All 100th percentile latency scroll 335018 ms
All 50th percentile service time scroll 1151.23 ms
All 90th percentile service time scroll 1516.23 ms
All 99th percentile service time scroll 2033.21 ms
All 100th percentile service time scroll 2359.8 ms
All error rate scroll 0 %

The questions are:

  • Why are there no values at the top of the table for 'Indexing time', 'Indexing throttle time'... How can it be fixed?
  • I can guess it but am I right to say that:
    | All | Median Throughput | index-append | **76929.9** | docs/s | means that the cluster is able to index 76929.9 documents per second? We need such information to limit new clients that send documents to index.
  • Is there any good measurement to get an idea on search performance (how many users can run simple searches like '*') while the elasticsearch nodes are continuously indexing documents?
  • What would be the main conclusion reading this report regarding search and index performance?

Could you please help to interpret the report or point to further documentation?

Thank you very much!

Regards,
ize0j10


Rally Track Report Analysis (follow up)
(Daniel Mitterdorfer) #2

There might have been a problem retrieving these values which you should see in the logs. You can check if you can retrieve the value by yourself via the indices stats API (that's what Rally uses internally). Can you please also share which version of Elasticsearch you are using?

Yes, this is correct. The cluster can index that many documents per second with the given document corpus (i.e. http_logs), the given bulk size, number of clients, cluster configuration and index settings.

You can do that but you need to create your own track to achieve that as we don't this in our standard tracks. You should especially look at the docs that explain how to use the "parallel" element in Rally. You might also be interested in the eventdata track which does that and also want to watch the Webinar Getting your cluster size right (free to watch but requires registration). Especially if you want to simulate multiple users executing operations it is crucial to think about choosing the correct schedule to model client arrivals in order to get meaningful results.

You need to watch out for two things:

  1. An error rate of zero otherwise you might look at non-representative results (e.g. it might include bulk rejections and the like). In your case that's fine.
  2. Check whether the system can achieve the target throughput which means that it can keep up with the injected load. This is not the case for a few queries (like term) where you can see that the throughput is less than what's specified in the track. Also, service time and latency are quite far apart indicating that clients need to wait for a very long time. See also the article Relating Service Utilisation to Latency and the Rally FAQ.

In general I'd suggest that you use the tracks that come out of the box only to get started and write your own track so you can measure a scenario that is close to your production workload to get a better idea of how the system behaves for your use-case and configuration.


#3

Using the API directly provided such values on the existing indices. Could it be that the API call took place before there had been such operations (for this reason zero values)?

For this benchmark we used Elasticsearch 6.1.2.


(Daniel Mitterdorfer) #4

Hi,

Rally queries all values at the start of a benchmark (which is, after the cluster has started), queries them again after the benchmark has ended and calculates the difference. In case a value cannot be retrieved, Rally will set the value to zero and issue a warning like:

Could not determine value at path [merges,total_time_in_millis]. Returning default value [0]

I just started a 6.1.2 node locally and ran esrally --pipeline=benchmark-only --challenge=append-no-conflicts-index-only which has produced:

|   Lap |                         Metric |         Task |     Value |   Unit |
|------:|-------------------------------:|-------------:|----------:|-------:|
|   All |                  Indexing time |              |   41.9884 |    min |
|   All |         Indexing throttle time |              |    0.2213 |    min |
|   All |                     Merge time |              |   20.5545 |    min |
|   All |                   Refresh time |              |   6.41378 |    min |
|   All |                     Flush time |              |    0.6925 |    min |
|   All |            Merge throttle time |              |   2.38558 |    min |

So it does not seem to be a systematic problem.

For reference, Rally issues basically the following (requires jq):

curl -s -XGET 'localhost:9200/_stats/_all?pretty&level=shards' | jq "._all.primaries"

If you could run this before you start Rally and after the benchmark has finished this might help me resolve the issue.


#5

Hi Daniel,

If you could run this before you start Rally and after the benchmark has finished this might help me resolve the issue.

Before:

{
"docs": {
"count": 1437022,
"deleted": 1702
},
"store": {
"size_in_bytes": 656372799
},
"indexing": {
"index_total": 2632778,
"index_time_in_millis": 1150048,
"index_current": 0,
"index_failed": 7,
"delete_total": 26351,
"delete_time_in_millis": 1051,
"delete_current": 0,
"noop_update_total": 8,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 105914,
"time_in_millis": 14243,
"exists_total": 105863,
"exists_time_in_millis": 14235,
"missing_total": 51,
"missing_time_in_millis": 8,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 209483,
"query_time_in_millis": 29444,
"query_current": 0,
"fetch_total": 187514,
"fetch_time_in_millis": 15158,
"fetch_current": 0,
"scroll_total": 26102,
"scroll_time_in_millis": 7460,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 34293,
"total_time_in_millis": 6486023,
"total_docs": 666452775,
"total_size_in_bytes": 183250991598,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 707,
"total_auto_throttle_in_bytes": 663462632
},
"refresh": {
"total": 334707,
"total_time_in_millis": 4129307,
"listeners": 0
},
"flush": {
"total": 24,
"total_time_in_millis": 480
},
"warmer": {
"current": 0,
"total": 308319,
"total_time_in_millis": 29123
},
"query_cache": {
"memory_size_in_bytes": 254278,
"total_count": 112798,
"hit_count": 9320,
"miss_count": 103478,
"cache_size": 34,
"cache_count": 1879,
"evictions": 1845
},
"fielddata": {
"memory_size_in_bytes": 2632,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 147,
"memory_in_bytes": 2587970,
"terms_memory_in_bytes": 1119254,
"stored_fields_memory_in_bytes": 194600,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 56512,
"points_memory_in_bytes": 240808,
"doc_values_memory_in_bytes": 976796,
"index_writer_memory_in_bytes": 21126311,
"version_map_memory_in_bytes": 6151324,
"fixed_bit_set_memory_in_bytes": 159720,
"max_unsafe_auto_id_timestamp": 1518163942643,
"file_sizes": {}
},
"translog": {
"operations": 610477,
"size_in_bytes": 840225274,
"uncommitted_operations": 421715,
"uncommitted_size_in_bytes": 544563526
},
"request_cache": {
"memory_size_in_bytes": 5136,
"evictions": 0,
"hit_count": 111219,
"miss_count": 113
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}

After:
{
"docs": {
"count": 248720257,
"deleted": 1906
},
"store": {
"size_in_bytes": 22079040416
},
"indexing": {
"index_total": 249970865,
"index_time_in_millis": 8934976,
"index_current": 0,
"index_failed": 7,
"delete_total": 26965,
"delete_time_in_millis": 1070,
"delete_current": 0,
"noop_update_total": 8,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 108378,
"time_in_millis": 14554,
"exists_total": 108327,
"exists_time_in_millis": 14546,
"missing_total": 51,
"missing_time_in_millis": 8,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 536867,
"query_time_in_millis": 4015818,
"query_current": 0,
"fetch_total": 455888,
"fetch_time_in_millis": 119689,
"fetch_current": 0,
"scroll_total": 37231,
"scroll_time_in_millis": 14369092,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 39936,
"total_time_in_millis": 12879214,
"total_docs": 1340704824,
"total_size_in_bytes": 250882666611,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 4050187,
"total_auto_throttle_in_bytes": 1270120170
},
"refresh": {
"total": 357658,
"total_time_in_millis": 5126715,
"listeners": 0
},
"flush": {
"total": 131,
"total_time_in_millis": 15469
},
"warmer": {
"current": 0,
"total": 330245,
"total_time_in_millis": 29907
},
"query_cache": {
"memory_size_in_bytes": 25793830,
"total_count": 258252,
"hit_count": 49516,
"miss_count": 208736,
"cache_size": 230,
"cache_count": 2080,
"evictions": 1850
},
"fielddata": {
"memory_size_in_bytes": 8504,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 739,
"memory_in_bytes": 95424188,
"terms_memory_in_bytes": 79959842,
"stored_fields_memory_in_bytes": 8047600,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 99584,
"points_memory_in_bytes": 6278918,
"doc_values_memory_in_bytes": 1038244,
"index_writer_memory_in_bytes": 267279,
"version_map_memory_in_bytes": 53971,
"fixed_bit_set_memory_in_bytes": 163368,
"max_unsafe_auto_id_timestamp": 1518163942643,
"file_sizes": {}
},
"translog": {
"operations": 77655904,
"size_in_bytes": 16659955310,
"uncommitted_operations": 33608,
"uncommitted_size_in_bytes": 37007332
},
"request_cache": {
"memory_size_in_bytes": 27728,
"evictions": 0,
"hit_count": 113956,
"miss_count": 173
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}

Thanks,
ize0j10


(Daniel Mitterdorfer) #6

Thanks for the data. Your output looks fine to me and should indeed produce numbers. I will improve the error handling and logging in this area so we have more information what goes wrong here. I have raised https://github.com/elastic/rally/issues/416.


(Daniel Mitterdorfer) #7

fyi, we've just released Rally 0.9.2 which adds a bit more logging in that area of the code.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.