Rally "default latency" performance degradation

env:Docker 4 cpu 16g mem 8g heap
config file:processors: 4
run test

while true; do
    esrally --pipeline=benchmark-only --target-hosts=${target_hosts} --cluster-health=skip --track=geonames --include-tasks ${action} --user-tag=${tag}

The first run score is normalThe first run score is normal, after the second is not normal, normal default servicetime, latency is not normal
Other indicators are normal. reference [aliliyun es] (https://help.aliyun.com/document_detail/62420.html)

ab simulation ”default“ challenge is not normal

ab -p payload.txt -T application/json -c 50 -n 500 http://es-1:9200/geonames/_search
Percentage of the requests served within a certain time (ms)
  50%    800
  66%    872
  75%    900
  80%    934
  90%   1090
  95%   1202
  98%   1319
  99%   1454
 100%   1626 (longest request)

When restarting the entire cluster, the result is normal again。

Only run index-append, default challenge rally "default latency" will not be so fast is not normal

How to confirm the cause of the problem in where?

I'd start by eliminating the Docker container as the cause and testing directly on the target machine. Besides, the numbers I also don't understand the linked text. :slight_smile: But I do see that you have tested a three node cluster and I assume that this means:

  • The load test driver is on a (physically) separate machine.
  • Each cluster node is on a dedicated (i.e. physically separate) machine.

You need to measure system metrics (like disk stats with iostat or behavior of the memory management subsystem with vmstat or similar tools) to find what causes the difference in performance. It would also make sense to have a look at Elasticsearch's GC logs to see whether the garbage collector is more active.

This is an indication that the issue is transient (i.e. I'd rule out disk issues).

Your test with ab is IMHO not that useful because it is bringing the system into saturation (see Relating Service Utilisation to Latency for more details what I mean).

Having that said, you can see in your data that the target throughput in your Rally benchmark is also too high for your system (hence the high latency compared to the service time). Finally note, that the number that ab reports is service time not latency (see also the Rally FAQ).

Thank you for your reply.

I will start a single node test on the physical machine

The link text is a similar elastic cloud product, the official performance indicators used as a reference

  • I use rally containers to access each es node container
  • Each es node on a host, a local mount each individual node the SSD, but the performance of the host is rich, and only for this test preparation (E5-2630 v4).

I also think it is impossible to disk problems, because index speed is normal and limited by the cpu resources. gc log only one of the more common does not prove anything

I am a bit confused, ab return should be latency? Elasticsearch returned took field is service time? In this test, the same after restarting the cluster will be much faster.

The default challenge is just

	"query": {
		"match_all": {}

The target speed is also the highest 50 / ops, when the problem is 30 / ops, the performance of service time (took field) performance is normal, but the latency is very large, the other challenges seem no problem.

In addition, I also have a docker with k8s + ceph environment, the problem is the same
I suspect:

  • docker problem
  • Resource limit is too low

Continue to debug.

It should not be a network problem either

qperf -t 60 tcp_bw tcp_lat
bw = 4.94 GB/sec
latency = 8.53 us

There is no problem with the host starting a single-node es (configuration file: processors: 4) directly, and the time to run through all the challenges is shorter than 3 nodes. . .

Now start a single-node ES container (configuration files specify processors: 4), without resource limits for testing

You are not the only one that is confused. The vast majority of load testing tools get latency wrong - and ab is one of them (nothing against ab, it is a fine tool but you need to be aware of the difference). What ab calls "latency" is actually "service time". And to be 100% precise, if you do not throttle throughput, then "latency" == "service time". But measuring query latency that way is wrong to begin with (for the very reason detailed in Relating Service Utilisation to Latency).

Long story short: If you want to compare the numbers you get from ab with the numbers you get from Rally, you need to compare ab's latency numbers to Rally's service time numbers. But then you also need to run a comparable benchmark with Rally, which means:

  • 50 concurrent clients
  • 500 iterations
  • No throughput throttling

But IMHO the more important point is to measure query latency correctly.

This is the same problem again. Let me explain that with an analogy:

Suppose you are barista in a coffee shop. Let's assume it takes you one minute to prepare a coffee. This is the time your server is busy servicing a customer's request and this is what we call service time. If less than one customer per minute enters your coffee shop, you are less than 100% utilized. Thus: Every customer can order their coffee immediately (i.e. there is no waiting line).

Now suppose, two customers per minute enter the coffee shop (2 ops/minute). The "problem" is that it takes you still one minute to prepare one coffee, i.e. customers inject too much load into the system (your maximum throughput is 1 op/minute). What happens? A waiting line will build up. And latency is telling you exactly this fact: It takes the waiting time of customers into account. If you just look at the service time, everything is "fine": it still takes you one minute to prepare the coffee no matter how many customers enter. But as customers enter the coffee shop twice as fast as you can service them, the waiting line will grow and grow and thus will latency.

The take away is: You need to reduce the target throughput to a level that is sustainable for the system, i.e. latency and service time should be close. In your case I'd guess that this is somewhere between 20 ops/s to 25 ops/s (but you need to measure this).

These numbers looks fine indeed.

1 Like

What I want to ask is actually why the first test can reach the target value, but the latter test can not reach it.

while true; do
    esrally --pipeline=benchmark-only --target-hosts=${target_hosts} --cluster-health=skip --track=geonames --include-tasks ${action} --user-tag=${tag}

For example, the first day of cafes can be the maximum throughput is 50, the next day it has become 20, and the load is the same every day are 1500, the clerk did not rest too tired: D? I think this question is not really a problem, not my request to reduce the staff

I try to adjust the gc thread number, it seems that no effect, the original gc thread count is based on the host cpu core count.

There is es node to enhance cpu to 8 core, the same problem

You mentioned in previous posts:

This points to a transient problem (e.g. memory usage / memory fragmentation over time).

Did you check that the disk is not the problem as I suggested (e.g. with iostat)? If it is not the disk, it could be - for example - memory related. Then you should check what the kernel is doing w.r.t to paging (e.g. with vmstat).

Restart the cluster before running default
Restart the cluster after running default
heap usage

I suspect that after some challenge, some problems have occured in es with resource-constrained docker containers, which will lead to subsequent tests that are not working properly and I will conduct an exclusion test

jstack output I can not see any different, I do not quite understand

Only index-append, force-merge, index-stats, node-stats, default, term are always normal


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.