Hi,
we're using rally for performance evaluation. In our case, it's about the effect of a JVM to Elasticsearch's performance (disclaimer: I work for Azul).
We would like to use challenge "elastic/logs", track "logging-indexing-querying" as it, based on our experience, represents quite a realistic scenario - customers constantly indexing new logs while doing search queries in parallel. Which is exactly what the "logging-indexing-querying" track should be doing.
When we tried it, we found a couple of problems. Let me go through them separately.
Problem #1: Lack of latencies of respective queries
When executing the track, here's the result we get:
| Metric | Task | Value | Unit |
|---------------------------------------------------------------:|------------------------:|-----------------:|-------:|
| Cumulative indexing time of primary shards | | 1.56297 | min |
| Min cumulative indexing time across primary shards | | 0 | min |
| Median cumulative indexing time across primary shards | | 0.0449667 | min |
| Max cumulative indexing time across primary shards | | 0.838067 | min |
| Cumulative indexing throttle time of primary shards | | 0 | min |
| Min cumulative indexing throttle time across primary shards | | 0 | min |
| Median cumulative indexing throttle time across primary shards | | 0 | min |
| Max cumulative indexing throttle time across primary shards | | 0 | min |
| Cumulative merge time of primary shards | | 0.1191 | min |
| Cumulative merge count of primary shards | | 5 | |
| Min cumulative merge time across primary shards | | 0 | min |
| Median cumulative merge time across primary shards | | 0 | min |
| Max cumulative merge time across primary shards | | 0.0779833 | min |
| Cumulative merge throttle time of primary shards | | 0 | min |
| Min cumulative merge throttle time across primary shards | | 0 | min |
| Median cumulative merge throttle time across primary shards | | 0 | min |
| Max cumulative merge throttle time across primary shards | | 0 | min |
| Cumulative refresh time of primary shards | | 0.143067 | min |
| Cumulative refresh count of primary shards | | 121 | |
| Min cumulative refresh time across primary shards | | 0 | min |
| Median cumulative refresh time across primary shards | | 0.00335 | min |
| Max cumulative refresh time across primary shards | | 0.0556167 | min |
| Cumulative flush time of primary shards | | 0.00198333 | min |
| Cumulative flush count of primary shards | | 13 | |
| Min cumulative flush time across primary shards | | 0 | min |
| Median cumulative flush time across primary shards | | 8.33333e-05 | min |
| Max cumulative flush time across primary shards | | 0.00075 | min |
| Store size | | 0.101498 | GB |
| Translog size | | 6.65896e-07 | GB |
| Heap used for segments | | 0 | MB |
| Heap used for doc values | | 0 | MB |
| Heap used for terms | | 0 | MB |
| Heap used for norms | | 0 | MB |
| Heap used for points | | 0 | MB |
| Heap used for stored fields | | 0 | MB |
| Segment count | | 52 | |
| Total Ingest Pipeline count | | 400000 | |
| Total Ingest Pipeline time | | 69.393 | s |
| Total Ingest Pipeline failed | | 0 | |
| Min Throughput | insert-pipelines | 10.29 | ops/s |
| Mean Throughput | insert-pipelines | 10.29 | ops/s |
| Median Throughput | insert-pipelines | 10.29 | ops/s |
| Max Throughput | insert-pipelines | 10.29 | ops/s |
| 100th percentile latency | insert-pipelines | 1455.15 | ms |
| 100th percentile service time | insert-pipelines | 1455.15 | ms |
| error rate | insert-pipelines | 0 | % |
| Min Throughput | insert-ilm | 35.39 | ops/s |
| Mean Throughput | insert-ilm | 35.39 | ops/s |
| Median Throughput | insert-ilm | 35.39 | ops/s |
| Max Throughput | insert-ilm | 35.39 | ops/s |
| 100th percentile latency | insert-ilm | 26.9471 | ms |
| 100th percentile service time | insert-ilm | 26.9471 | ms |
| error rate | insert-ilm | 0 | % |
| error rate | discover/search | 0 | % |
| error rate | discover/visualize | 0 | % |
| error rate | kafka | 0 | % |
| error rate | nginx | 0 | % |
| error rate | apache | 0 | % |
| error rate | system/auth | 0 | % |
| error rate | system/syslog/dashboard | 0 | % |
| error rate | system/syslog/lens | 0 | % |
| error rate | mysql/dashboard | 0 | % |
| error rate | redis | 0 | % |
| error rate | mysql/lens | 0 | % |
| error rate | postgresql/overview | 0 | % |
| error rate | postgresql/duration | 0 | % |
| Min Throughput | bulk-index | 1197.09 | docs/s |
| Mean Throughput | bulk-index | 12379.2 | docs/s |
| Median Throughput | bulk-index | 13364.8 | docs/s |
| Max Throughput | bulk-index | 16276.7 | docs/s |
| 50th percentile latency | bulk-index | 285.791 | ms |
| 90th percentile latency | bulk-index | 827.314 | ms |
| 99th percentile latency | bulk-index | 1304.18 | ms |
| 100th percentile latency | bulk-index | 2621.23 | ms |
| 50th percentile service time | bulk-index | 285.791 | ms |
| 90th percentile service time | bulk-index | 827.314 | ms |
| 99th percentile service time | bulk-index | 1304.18 | ms |
| 100th percentile service time | bulk-index | 2621.23 | ms |
| error rate | bulk-index | 0 | % |
The way we see it, we see only error rates of respective queries like nginx, kafka etc. However, we would also like to see what are the latencies (= how long it took to complete the query) of those respective queries. Are we missing something? Is there a way how to retrieve that information?
Problem #2: Length of the benchmark
Based on our observation, it looks like those queries are the benchmark lasts for the amount of time it takes the indexing part to finish. And during that time, the search queries are executed.
Would it be possible to specify the length based on time? Let's say "do the indexing and queries (repeat the set of queries one by one) for X seconds?"?
Problem #3: Rate limiting
The scenario that we want to evaluate the JVMs on is that we would like to compare the latencies of the search queries while doing the indexing. In order to do that and to have an apples-to-apples comparison, we need to be able to fix the amount of indexing requests and possibly even the amount of search requests per certain time.
We haven't found any way how to achieve that. In other tracks, such as "sql", there's "target-throughput" directive that seems to do pretty much exactly what we want. Is this something that can be used in "logging-indexing-querying" as well, or is it track specific? Our current experiments suggest the latter.
Thank you very much for any comments, highly appreciated.
Jiri