The number of clients in search operation

When I set the number of clients to 10, rally runs correctly, and when I set the client to 30:

  "schedule": [
{
  "operation": {
"name":"N clients randomly query 10,000 data",
"operation-type": "search",
    "param-source": "randomly-query"
  },
  "warmup-iterations": 20,
  "iterations": 500,
  "target-throughput": 200,
  "clients":30
}
]

rally has the following error:

Running N clients randomly query 10,000 data                                   [ 17% done][WARNING] Could not 
terminate all internal processes within timeout. Please check and force-terminate all Rally processes.
[ERROR] Cannot race. Load generator [%d] has exited prematurely.

Is this error caused by the number of clients?If so, what is the number of clients typically set to?

Here is the log:

2018-08-07 07:08:15,746 ActorAddr-(T|:32901)/PID:2666 esrally.actor INFO LoadGenerator[12] is exiting due to ActorExitRequest.
2018-08-07 07:08:15,797 ActorAddr-(T|:32901)/PID:2666 esrally.driver.driver INFO User cancelled execution.
2018-08-07 07:08:16,649 ActorAddr-(T|:38719)/PID:2652 esrally.actor INFO Load generator [12] has exited.
2018-08-07 07:08:16,664 -not-actor-/PID:2641 esrally.rally WARNING Shutdown timed out. Actor system is still running.
2018-08-07 07:08:16,668 -not-actor-/PID:2641 esrally.rally ERROR Cannot run subcommand [race].
Traceback (most recent call last):
File "/home/dr/.local/lib/python3.6/site-packages/esrally/rally.py", line 454, in dispatch_sub_command
race(cfg)
File "/home/dr/.local/lib/python3.6/site-packages/esrally/rally.py", line 383, in race
with_actor_system(lambda c: racecontrol.run(c), cfg)
File "/home/dr/.local/lib/python3.6/site-packages/esrally/rally.py", line 404, in with_actor_system
runnable(cfg)
File "/home/dr/.local/lib/python3.6/site-packages/esrally/rally.py", line 383, in
with_actor_system(lambda c: racecontrol.run(c), cfg)
File "/home/dr/.local/lib/python3.6/site-packages/esrally/racecontrol.py", line 383, in run
raise e
File "/home/dr/.local/lib/python3.6/site-packages/esrally/racecontrol.py", line 380, in run
pipeline(cfg)
File "/home/dr/.local/lib/python3.6/site-packages/esrally/racecontrol.py", line 61, in call
self.target(cfg)
File "/home/dr/.local/lib/python3.6/site-packages/esrally/racecontrol.py", line 327, in benchmark_only
return race(cfg, external=True)
File "/home/dr/.local/lib/python3.6/site-packages/esrally/racecontrol.py", line 279, in race
raise exceptions.RallyError(result.message, result.cause)
esrally.exceptions.RallyError: ('Load generator [%d] has exited prematurely.', None)

It appears that one of the clients got stuck; since you are using a custom parameter source this is rather hard to debug without having access to the implementation of your parameter source. Did you also check Elasticsearch and its logs after this happened?

What does your partition() method look like in your parameter source code? Are you just returning self or is your code doing different things per client?

Thank you for your reply.
I plan to make each client perform different search bodies every iteration. This is my track.py:

import random
import os
import json
class randomly_query_body:
   def __init__(self, track, params, **kwargs):

    # we can eagerly resolve these parameters already in the constructor...
    cwd = os.path.dirname(__file__)
    with open(os.path.join(cwd, "search_body.json"), "r") as ins:
        self.terms = json.load(ins)		
		
    # ... but we need to resolve "profession" lazily on each invocation later
    self._params = params

def partition(self, partition_index, total_partitions):
    return self

def size(self):
    return 1

def params(self):
    rand_body = random.choice(self.terms)
    return {
        'body': rand_body,
        'index': 'image_v2',
        'type': 'fp',
        'cache': False,
		'use_request_cache': False
    }
	


def register(registry):
     registry.register_param_source("randomly-query", randomly_query_body)

Thanks for sharing this. I don't see anything standing out in the parameter source code (I assume that your search bodies are stored in a top level array in search_body.json).

Did you spot anything in the Elasticsearch logs?

Sorry, I have no access to the Elasticsearch logs.
Is this problem caused by too many clients?

Two more questions:

  1. In general, how many requests per second can each client make when benckmarking? If it's a big number, does it mean we don't have to run a lot of clients when benchmarking ES?
  2. when performing the search operation, the progress will sometimes change from high to low, such as, from 50% to 46%. Why?

Thanks.

Trying to answer both previous questions here.

It's a bit unusual to benchmark Elasticsearch without access to its logs; keep in mind that if it's a production system you could introduce a service outage (similar to a DOS attack) by aggressively having a lots of clients for your queries.

When you initially configured target-throughtput:200 and clients:10 each client tried to issue 20 requests (queries) per second. However, request-response is blocking per client, so if your cluster is struggling to cope with the queries, it's possible this rate hasn't been achieved. You can also see this as a difference between service_time and latency[1].

Additionally, the default value for client timeout is 60s (docs here), so it's possible that when you increased the amount of clients from 10 to 30 the additional queries (as there are more clients that aren't blocked by slow responses) made the ES cluster slower to execute the queries. At any rate just increasing the clients when your cluster is being progressively slower is not going to provide any value to your benchmarks and will likely cause nodes to die.

re: 2. where you observed the progress going from high to low, this is weird and could be related a situation where nodes or the cluster is dying.

[1] The faq section contains a good explanation of service_time vs latency.

1.When I initially configured target-throughput:20 and clients:1, I have received the following report
max-throughput: 18.6 99th percentile latency: 2391 99th percentile service time : 97
2.When I initially configured target-throughput:100 and clients:10, I have received the following report
max-throughput: 99 99th percentile latency: 395 99th percentile service time : 362

From test 2, we can find the cluster can reach 100 throughput; i wonder why 99th percentile latency in test 1 is so large(2391ms) even though the target-throughput is below 100 ?

In my case, I think the throughput of my cluster is relatively high, but from the test results, i find the throughput is low. So I wonder if the load generated by rally is too low; I want to know the maximum number of clients in the search operation, and the maximum request per second per client.

Thank you very much !

Some inline comments below. Note that without seeing the entire Rally report it's not easy to understand the whole picture.
Esp. for service_time vs latency, the FAQ mentioned earlier is an essential read to understand the differences.

1.When I initially configured target-throughput:20 and clients:1, I have received the following report
max-throughput: 18.6 99th percentile latency: 2391 99th percentile service time : 97

max-thoughtput not achieving the set target-throughput and the very high value of 99th percentile latency vs 99th percentile service time indicates that the benchmark is unstable and the numbers are not very representative. Despite of this, I have a few additional, more detailed comments:

With clients:1 you need to remember what I mentioned earlier, i.e. request-response is blocking.

So if your cluster is slow executing the queries, you can't achieve the configured target-throughput and this is what happened here, the max throughput achieved was 18.6ops/s < 20 target-throughput.
Median Throughput would also be useful to check.

The 99th percentile service time (service_time is time taken between issuing request and receiving response) metric tells us that 99% of requests took <=97ms to return.
It's good to keep in mind that interpretation of those numbers is related to the iterations; e.g. if you had 500 iterations, 99th percentile means that 5 requests ended up having >97ms service time. It would be useful to see the other percentiles here e.g. 50th to see how slow a larger % of operations were.

For latency it's best to read the great analogy with the Barista and Coffee shop, mentioned by @danielmitterdorfer here
TL;DR If requests are constantly taking longer to execute than the calculated schedule, this piles up and can lead to high latency values [1].

2.When I initially configured target-throughput:100 and clients:10, I have received the following report
max-throughput: 99 99th percentile latency: 395 99th percentile service time : 362

Here you have more clients to issue parallel requests despite the request-response delay and you can see that increasing your clients and target-throughput also increased the service time more than 4x to 362ms.

This shows that, the more queries you send, the slower your cluster is able to service the queries.
Note that issuing search requests is a lightweight operation for Rally (there isn't much expensive IO in the background like with bulk for example).

From test 2, we can find the cluster can reach 100 throughput; i wonder why 99th percentile latency in test 1 is so large(2391ms) even though the target-throughput is below 100 ?

In my case, I think the throughput of my cluster is relatively high, but from the test results, i find the throughput is low. So I wonder if the load generated by rally is too low; I want to know the maximum number of clients in the search operation, and the maximum request per second per client.

The number of clients depends on what you want to achieve; e.g. if you are benchmarking a 300ms SLA for your query responses under a load of 100 queries/s, you'll need at least 30 clients. Rally will schedule load among its clients to achieve the target-throughput and not more.

I would also recommend that you enable an Elasticsearch metrics store so that you can have better visibility into each request. For each search operation-type, there will be metric records for latency, service_time and throughput to help you understand better what's happening over time than just the summary at the end.

[1] With clients:1 and target-throughput:20, Rally will ensure that requests are only issued every 1/20=50ms. If a large amount of operations takes >50ms to return, this will pile up and influence the latency_time.

Thank you for your patience. I have got it.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.