Can't reach Rally Target throughput


(Sunil Pandith) #1

Hi,

I have a 4 node cluster setup. I'm benchmarking the setup using Rally. I have set the no. of "client" for a search operation in "track.json" to 12 with "target-throughput" of 2500. But the rally reported that it has performed only 2000 odd no. of operation against the set target of 2500 req/sec. I have increased the n0. of clients to 18, 20, 24, 30 and so on but the reported no. of operation for the search had not gone beyond 2000 req/sec.

Is this due to the limitation of server(co-ordinator) or rally or cluster setup?

I'm running benchmark co-ordinator on 4 core CPU, 16GB memory instance.

Thanks,
Sunil


(Christian Dahlqvist) #2

It is difficult to determine what is limiting throughput with seeing what the load is on the cluster and the Rally node. Is CPU saturated on the Rally node when you read 2000 QPS? What does CPU usage on the Elasticsearch cluster look like? Depending on how much data each query is returning, it could also be that network performance is a limiting factor on the Rally node.


(Sunil Pandith) #3

I have to check the CPU usage. In mean time, does network bandwidth cause huge latency in the metrics? Because when i check the latency for target throughput 2000 req/sec it is around 4000msec but for 1000req/sec its 3-4msec.


(Christian Dahlqvist) #4

Look at service time in the results and see if this varies with throughput. It is also possible that you are hitting the limit of what your cluster can handle. Do you have monitoring installed?


(Sunil Pandith) #5

Service time is not changing, it is always around 4 - 5ms irrespective of the set target throughput but latency is huge (order of seconds)


(Daniel Mitterdorfer) #6

Hi,

I assume that you are not talking about the 50th percentile but rather the 100th percentile (i.e. maximum). You should start with a target throughput of the inverse of that, so maybe at most 250 ops/s (1 operation / 5ms), lower that a bit to maybe 200 ops/s and gradually increase throughput from there. If latency is much higher than service time that indicates the benchmark does not reach a steady state and the system in that specific setup cannot cope with the throughput you are aiming for. See also our FAQ for more info. I'd also suggest to start with a single client first and then gradually increase the number of clients.

Daniel


(Dimitrios Liappis) #7

Wrt checking the load, network utilization and IO usage on your load generating machine, I recommend going over https://medium.com/netflix-techblog/linux-performance-analysis-in-60-000-milliseconds-accc10403c55 which I believe is based on Brendan Gregg's USE Method/Performance checklist.


(Sunil Pandith) #8

@danielmitterdorfer sure will try that. Thanks :slight_smile:


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.