Possible reason for > 30% difference between iterations of geonames -> painless_static

asp · March 29, 2019, 8:22am

During my task to compare performance between dockerized elasticsearch and running elasticsearch as native service / on shell, I discovered big fluctuation between the tests - when rerunning the same testset!

First I compared the overall time of the run:

so we van see about 4 minutes difference in overall time.
I digged deeper to see which operation is causing this:

There are a few operations which are slower, but painless_static is the negative winner.

Above we see throughput, latency and service time. The second run is much slower.

But I am not able to find anything suspcios in metricbeat dashboards:

see 12:24 - 12:28 and compare with 13:02 - 13:09

Here is my testing process:

delete old geonames-index via kibana
stop elasticsearch
stop kibana
start elasticsearch
start kibana
wait until elasticsearch is up
run esrally with external car, track geonames, challenge default.

Nothing else is running on the server where elasticsearch is tested (except for metricbeat).
esrally runs on a system shared with elastic dev system with very low load. But I also stopped that dev system in previous runs, where the benchmarks had the same fluctuation.

Any help is really appreciated.
Thanks a lot, Andreas

danielmitterdorfer · April 1, 2019, 2:21pm

Hi Andreas,

to understand what's happening I suggest to watch my talk The Seven Deadly Sins of Elasticsearch Benchmarking (free to watch but requires prior registration). Please check item "sin 3" which covers your question extensively. See also the related blog post Seven Tips for Better Elasticsearch Benchmarks which is a summary of the talk.

Daniel

asp · April 2, 2019, 6:10am

Hey Daniel, Thanks for the reply.

your workshop was interesting. It's clear to me that the latency will go up, if we query faster than the system can respond because auf the raising queue.

What is not completely clear to me is that, why the service time is varying that much. Do you think it is caused by the overload? And if I lower the target throughput the values should become more stable?

Regards, Andreas

danielmitterdorfer · April 9, 2019, 12:05pm

The talk (and in fact Rally as well) is making a simplifying assumption, namely that the benchmarked system can be modelled using only one queue (also known as M/M/1 queue in queuing theory) but systems in practice can have several queues, e.g. incoming network packets can queue up on OS level, runnable processes queue up in the CPU scheduler's run queue, Elasticsearch has a queue in front of its thread pool and if multiple Elasticsearch nodes are processing a query even more queues are involved. So service time is only an approximation (although the best one that we can get from a client perspective) and that would explain why you see a varying service time.

As a corollary from my previous reasoning, this could indeed be the case and would make sense to test.

Daniel

system · May 7, 2019, 12:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rally "default latency" performance degradation Elasticsearch rally	9	1490	April 13, 2018
Rally: compare multiple races / compare in kibana? Elasticsearch rally	7	2036	May 7, 2019
Esrallyのスコアの見方について日本語による質問・議論はこちら	4	1176	May 5, 2017
[esrally] what cause difference of latency and service time?, what is proper way of custom parameter Elasticsearch rally	5	683	April 6, 2023
Benchmarking results investigation Elasticsearch rally	4	603	June 22, 2018

Possible reason for > 30% difference between iterations of geonames -> painless_static

Related topics