Rally Track Report Analysis

ize0j10 · January 30, 2018, 9:32am

Hello,

we want to get a measurement on resources that still might be available in our elasticsearch cluster.

To get an idea we run Rally 0.9.:
esrally --track="http_logs" --target-hosts="node1, node2" --pipeline=benchmark-only --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'superuser',basic_auth_password:'...'"

We now have difficulties to interpret the report:

Lap	Metric	Task	Value	Unit
All	Indexing time		0	min
All	Indexing throttle time		0	min
All	Merge time		0	min
All	Refresh time		0	min
All	Flush time		0	min
All	Merge throttle time		0	min
All	Total Young Gen GC		0	s
All	Total Old Gen GC		0	s
All	Min Throughput	index-append	74604.3	docs/s
All	Median Throughput	index-append	76929.9	docs/s
All	Max Throughput	index-append	77214.6	docs/s
All	50th percentile latency	index-append	445.248	ms
All	90th percentile latency	index-append	693.113	ms
All	99th percentile latency	index-append	1090.24	ms
All	99.9th percentile latency	index-append	1497.43	ms
All	99.99th percentile latency	index-append	1918.44	ms
All	100th percentile latency	index-append	2313.48	ms
All	50th percentile service time	index-append	445.248	ms
All	90th percentile service time	index-append	693.113	ms
All	99th percentile service time	index-append	1090.24	ms
All	99.9th percentile service time	index-append	1497.43	ms
All	99.99th percentile service time	index-append	1918.44	ms
All	100th percentile service time	index-append	2313.48	ms
All	error rate	index-append	0	%
All	I CUT term, hourly_agg, index-stats and node stats to match body limitation.
All	Min Throughput	default	3.33	ops/s
All	Median Throughput	default	3.36	ops/s
All	Max Throughput	default	3.38	ops/s
All	50th percentile latency	default	69916.2	ms
All	50th percentile latency	default	69916.2	ms
All	90th percentile latency	default	108802	ms
All	99th percentile latency	default	117133	ms
All	100th percentile latency	default	118069	ms
All	50th percentile service time	default	285.997	ms
All	90th percentile service time	default	344.239	ms
All	99th percentile service time	default	452.941	ms
All	100th percentile service time	default	547.525	ms
All	error rate	default	0	%
All	Min Throughput	range	1.76	ops/s
All	Median Throughput	range	1.78	ops/s
All	Max Throughput	range	1.79	ops/s
All	50th percentile latency	range	12738.3	ms
All	90th percentile latency	range	19092.1	ms
All	99th percentile latency	range	20829.9	ms
All	100th percentile latency	range	21039.9	ms
All	50th percentile service time	range	559.023	ms
All	90th percentile service time	range	665.189	ms
All	99th percentile service time	range	852.438	ms
All	100th percentile service time	range	1131.42	ms
All	error rate	range	0	%
All	Min Throughput	scroll	19.45	ops/s
All	Median Throughput	scroll	20.25	ops/s
All	Max Throughput	scroll	20.84	ops/s
All	50th percentile latency	scroll	228750	ms
All	90th percentile latency	scroll	313780	ms
All	99th percentile latency	scroll	332907	ms
All	100th percentile latency	scroll	335018	ms
All	50th percentile service time	scroll	1151.23	ms
All	90th percentile service time	scroll	1516.23	ms
All	99th percentile service time	scroll	2033.21	ms
All	100th percentile service time	scroll	2359.8	ms
All	error rate	scroll	0	%

The questions are:

Why are there no values at the top of the table for 'Indexing time', 'Indexing throttle time'... How can it be fixed?
I can guess it but am I right to say that:
| All | Median Throughput | index-append | **76929.9** | docs/s | means that the cluster is able to index 76929.9 documents per second? We need such information to limit new clients that send documents to index.
Is there any good measurement to get an idea on search performance (how many users can run simple searches like '*') while the elasticsearch nodes are continuously indexing documents?
What would be the main conclusion reading this report regarding search and index performance?

Could you please help to interpret the report or point to further documentation?

Thank you very much!

Regards,
ize0j10

danielmitterdorfer · January 30, 2018, 10:35am

There might have been a problem retrieving these values which you should see in the logs. You can check if you can retrieve the value by yourself via the indices stats API (that's what Rally uses internally). Can you please also share which version of Elasticsearch you are using?

Yes, this is correct. The cluster can index that many documents per second with the given document corpus (i.e. http_logs), the given bulk size, number of clients, cluster configuration and index settings.

You can do that but you need to create your own track to achieve that as we don't this in our standard tracks. You should especially look at the docs that explain how to use the "parallel" element in Rally. You might also be interested in the eventdata track which does that and also want to watch the Webinar Getting your cluster size right (free to watch but requires registration). Especially if you want to simulate multiple users executing operations it is crucial to think about choosing the correct schedule to model client arrivals in order to get meaningful results.

You need to watch out for two things:

An error rate of zero otherwise you might look at non-representative results (e.g. it might include bulk rejections and the like). In your case that's fine.
Check whether the system can achieve the target throughput which means that it can keep up with the injected load. This is not the case for a few queries (like term) where you can see that the throughput is less than what's specified in the track. Also, service time and latency are quite far apart indicating that clients need to wait for a very long time. See also the article Relating Service Utilisation to Latency and the Rally FAQ.

In general I'd suggest that you use the tracks that come out of the box only to get started and write your own track so you can measure a scenario that is close to your production workload to get a better idea of how the system behaves for your use-case and configuration.

ize0j10 · January 31, 2018, 11:42am

Using the API directly provided such values on the existing indices. Could it be that the API call took place before there had been such operations (for this reason zero values)?

For this benchmark we used Elasticsearch 6.1.2.

danielmitterdorfer · February 6, 2018, 8:05am

Hi,

Rally queries all values at the start of a benchmark (which is, after the cluster has started), queries them again after the benchmark has ended and calculates the difference. In case a value cannot be retrieved, Rally will set the value to zero and issue a warning like:

Could not determine value at path [merges,total_time_in_millis]. Returning default value [0]

I just started a 6.1.2 node locally and ran esrally --pipeline=benchmark-only --challenge=append-no-conflicts-index-only which has produced:

|   Lap |                         Metric |         Task |     Value |   Unit |
|------:|-------------------------------:|-------------:|----------:|-------:|
|   All |                  Indexing time |              |   41.9884 |    min |
|   All |         Indexing throttle time |              |    0.2213 |    min |
|   All |                     Merge time |              |   20.5545 |    min |
|   All |                   Refresh time |              |   6.41378 |    min |
|   All |                     Flush time |              |    0.6925 |    min |
|   All |            Merge throttle time |              |   2.38558 |    min |

So it does not seem to be a systematic problem.

For reference, Rally issues basically the following (requires jq):

curl -s -XGET 'localhost:9200/_stats/_all?pretty&level=shards' | jq "._all.primaries"

If you could run this before you start Rally and after the benchmark has finished this might help me resolve the issue.

ize0j10 · February 12, 2018, 11:12am

Hi Daniel,

If you could run this before you start Rally and after the benchmark has finished this might help me resolve the issue.

Before:

{
"docs": {
"count": 1437022,
"deleted": 1702
},
"store": {
"size_in_bytes": 656372799
},
"indexing": {
"index_total": 2632778,
"index_time_in_millis": 1150048,
"index_current": 0,
"index_failed": 7,
"delete_total": 26351,
"delete_time_in_millis": 1051,
"delete_current": 0,
"noop_update_total": 8,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 105914,
"time_in_millis": 14243,
"exists_total": 105863,
"exists_time_in_millis": 14235,
"missing_total": 51,
"missing_time_in_millis": 8,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 209483,
"query_time_in_millis": 29444,
"query_current": 0,
"fetch_total": 187514,
"fetch_time_in_millis": 15158,
"fetch_current": 0,
"scroll_total": 26102,
"scroll_time_in_millis": 7460,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 34293,
"total_time_in_millis": 6486023,
"total_docs": 666452775,
"total_size_in_bytes": 183250991598,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 707,
"total_auto_throttle_in_bytes": 663462632
},
"refresh": {
"total": 334707,
"total_time_in_millis": 4129307,
"listeners": 0
},
"flush": {
"total": 24,
"total_time_in_millis": 480
},
"warmer": {
"current": 0,
"total": 308319,
"total_time_in_millis": 29123
},
"query_cache": {
"memory_size_in_bytes": 254278,
"total_count": 112798,
"hit_count": 9320,
"miss_count": 103478,
"cache_size": 34,
"cache_count": 1879,
"evictions": 1845
},
"fielddata": {
"memory_size_in_bytes": 2632,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 147,
"memory_in_bytes": 2587970,
"terms_memory_in_bytes": 1119254,
"stored_fields_memory_in_bytes": 194600,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 56512,
"points_memory_in_bytes": 240808,
"doc_values_memory_in_bytes": 976796,
"index_writer_memory_in_bytes": 21126311,
"version_map_memory_in_bytes": 6151324,
"fixed_bit_set_memory_in_bytes": 159720,
"max_unsafe_auto_id_timestamp": 1518163942643,
"file_sizes": {}
},
"translog": {
"operations": 610477,
"size_in_bytes": 840225274,
"uncommitted_operations": 421715,
"uncommitted_size_in_bytes": 544563526
},
"request_cache": {
"memory_size_in_bytes": 5136,
"evictions": 0,
"hit_count": 111219,
"miss_count": 113
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}

After:
{
"docs": {
"count": 248720257,
"deleted": 1906
},
"store": {
"size_in_bytes": 22079040416
},
"indexing": {
"index_total": 249970865,
"index_time_in_millis": 8934976,
"index_current": 0,
"index_failed": 7,
"delete_total": 26965,
"delete_time_in_millis": 1070,
"delete_current": 0,
"noop_update_total": 8,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 108378,
"time_in_millis": 14554,
"exists_total": 108327,
"exists_time_in_millis": 14546,
"missing_total": 51,
"missing_time_in_millis": 8,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 536867,
"query_time_in_millis": 4015818,
"query_current": 0,
"fetch_total": 455888,
"fetch_time_in_millis": 119689,
"fetch_current": 0,
"scroll_total": 37231,
"scroll_time_in_millis": 14369092,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 39936,
"total_time_in_millis": 12879214,
"total_docs": 1340704824,
"total_size_in_bytes": 250882666611,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 4050187,
"total_auto_throttle_in_bytes": 1270120170
},
"refresh": {
"total": 357658,
"total_time_in_millis": 5126715,
"listeners": 0
},
"flush": {
"total": 131,
"total_time_in_millis": 15469
},
"warmer": {
"current": 0,
"total": 330245,
"total_time_in_millis": 29907
},
"query_cache": {
"memory_size_in_bytes": 25793830,
"total_count": 258252,
"hit_count": 49516,
"miss_count": 208736,
"cache_size": 230,
"cache_count": 2080,
"evictions": 1850
},
"fielddata": {
"memory_size_in_bytes": 8504,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 739,
"memory_in_bytes": 95424188,
"terms_memory_in_bytes": 79959842,
"stored_fields_memory_in_bytes": 8047600,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 99584,
"points_memory_in_bytes": 6278918,
"doc_values_memory_in_bytes": 1038244,
"index_writer_memory_in_bytes": 267279,
"version_map_memory_in_bytes": 53971,
"fixed_bit_set_memory_in_bytes": 163368,
"max_unsafe_auto_id_timestamp": 1518163942643,
"file_sizes": {}
},
"translog": {
"operations": 77655904,
"size_in_bytes": 16659955310,
"uncommitted_operations": 33608,
"uncommitted_size_in_bytes": 37007332
},
"request_cache": {
"memory_size_in_bytes": 27728,
"evictions": 0,
"hit_count": 113956,
"miss_count": 173
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}

Thanks,
ize0j10

danielmitterdorfer · February 13, 2018, 7:06am

Thanks for the data. Your output looks fine to me and should indeed produce numbers. I will improve the error handling and logging in this area so we have more information what goes wrong here. I have raised https://github.com/elastic/rally/issues/416.

danielmitterdorfer · February 20, 2018, 9:30am

fyi, we've just released Rally 0.9.2 which adds a bit more logging in that area of the code.

system · March 20, 2018, 9:30am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Benchmarking results investigation Elasticsearch rally	4	581	June 22, 2018
Rally Track Report Analysis (follow up) Elasticsearch rally	18	2040	June 13, 2018
Benchmarking cluster with rally Elasticsearch rally	3	1209	August 23, 2021
ESRally Metrics Definition Elasticsearch rally	3	932	October 27, 2017
Benchmarking High Volumes Elasticsearch rally	2	508	May 11, 2019

Rally Track Report Analysis

Related topics