Esrally benchmark takes longer time to run while the report service time doesn't change

Hi,
I'm trying to run some test to see how the number of index affect the elasticsearch cluster performance.
I tested from 1k indices to 8k indices, and ran some random requests against those indices.
One weird thing I notice is, as the indices increase, the challenges took longer time to run (e.g. 5min when there are 1k indices, and about 30min when 8k indices), but the service time almost doesn't change at all..

I'm wondering what may be the reason that causes this discrepancy.

here is what i see in the report:

1k indices:


|                                                 Min Throughput | rand-sample-query |       31.31 |  ops/s |
|                                                Mean Throughput | rand-sample-query |       43.74 |  ops/s |
|                                              Median Throughput | rand-sample-query |       44.93 |  ops/s |
|                                                 Max Throughput | rand-sample-query |       46.22 |  ops/s |
|                                        50th percentile latency | rand-sample-query |     10.3159 |     ms |
|                                        90th percentile latency | rand-sample-query |     35.5788 |     ms |
|                                        99th percentile latency | rand-sample-query |     159.655 |     ms |
|                                      99.9th percentile latency | rand-sample-query |     1139.67 |     ms |
|                                     99.99th percentile latency | rand-sample-query |     1892.27 |     ms |
|                                       100th percentile latency | rand-sample-query |     2213.55 |     ms |
|                                   50th percentile service time | rand-sample-query |     10.3159 |     ms |
|                                   90th percentile service time | rand-sample-query |     35.5788 |     ms |
|                                   99th percentile service time | rand-sample-query |     159.655 |     ms |
|                                 99.9th percentile service time | rand-sample-query |     1139.67 |     ms |
|                                99.99th percentile service time | rand-sample-query |     1892.27 |     ms |
|                                  100th percentile service time | rand-sample-query |     2213.55 |     ms |
|                                                     error rate | rand-sample-query |        0.64 |      % |

[WARNING] Error rate is 0.64 for operation 'rand-sample-query'. Please check the logs.

---------------------------------
[INFO] SUCCESS (took 243 seconds)
---------------------------------

8k indicies:

|                                                 Min Throughput | rand-sample-query |       38.71 |  ops/s |
|                                                Mean Throughput | rand-sample-query |       44.37 |  ops/s |
|                                              Median Throughput | rand-sample-query |       44.75 |  ops/s |
|                                                 Max Throughput | rand-sample-query |       48.49 |  ops/s |
|                                        50th percentile latency | rand-sample-query |     10.3017 |     ms |
|                                        90th percentile latency | rand-sample-query |     41.1989 |     ms |
|                                        99th percentile latency | rand-sample-query |     165.875 |     ms |
|                                      99.9th percentile latency | rand-sample-query |     1097.79 |     ms |
|                                     99.99th percentile latency | rand-sample-query |     1937.27 |     ms |
|                                       100th percentile latency | rand-sample-query |     2209.77 |     ms |
|                                   50th percentile service time | rand-sample-query |     10.3017 |     ms |
|                                   90th percentile service time | rand-sample-query |     41.1989 |     ms |
|                                   99th percentile service time | rand-sample-query |     165.875 |     ms |
|                                 99.9th percentile service time | rand-sample-query |     1097.79 |     ms |
|                                99.99th percentile service time | rand-sample-query |     1937.27 |     ms |
|                                  100th percentile service time | rand-sample-query |     2209.77 |     ms |
|                                                     error rate | rand-sample-query |         0.5 |      % |

[WARNING] Error rate is 0.5 for operation 'rand-sample-query'. Please check the logs.

----------------------------------
[INFO] SUCCESS (took 1729 seconds)
----------------------------------

@fatcloud I am beginner here, but it would be helpful if you could add the track file and the commands , plus infra configurations.

That will help in answering the question ( By Some Expert )

@Sriram_Kumar
I checked the logs, I found the real benchmark execute time is not long, most time is spent on
Killing running Rally processes
there are hundreds of:

2023-02-03 07:01:47,851 -not-actor-/PID:41095 esrally.utils.process INFO Killing lingering process with PID [3467] and command line [[]].
2023-02-03 07:01:50,853 -not-actor-/PID:41095 esrally.utils.process INFO Killing lingering process with PID [3470] and command line [[]].

these killing process takes about 30min before the benchmark really starts.

But I'm pretty sure I'm the only user on the test env, not sure what those processes are...
My command is
esrally race --kill-running-processes --challenge=query-update-parallel --pipeline=benchmark-only --target-hosts={host} --track-path=/root/esrally-benchmark/track.json --user-tag="my-test:poc" --client-options="{certs..}"

Hi @fatcloud :slight_smile: - thanks for your interest in Rally!

You say you're not sharing the environment with anyone else, but if Rally isn't able to successfully 'shut down' after each race, it can leave some lingering processes. The number of processes Rally will spawn is dependent on the number of CPU cores it can see, so if you have a very large machine and it doesn't shut down correctly (i.e you cit CTRL+C too many times), it can take some time to terminate the left over processes.

That said, killing processes shouldn't take ~30 min. That does seem too long. If you can share the full log file + sample invocation we can take a look to see if there's a bug.

@Bradley_Deam
Thanks for the reply!
This is full log that contain a full run of a challenge, also some lines before this run.

Thanks for posting this. Based on the output it looks like you had 617 esrally processes, which is very odd.

Rally will create a seperate Python process for each available.cores, and each process can contain multiple 'clients' (i.e. the thing that actually sends requests to Elasticsearch).

In your instance it looks like the machine you're running esrally on has 8 cores, and you've specified 8 clients :

2023-02-07 20:30:25,401 ActorAddr-(T|:41707)/PID:50567 esrally.driver.driver INFO Allocating worker [0] on [localhost] with [1] clients.
2023-02-07 20:30:25,407 ActorAddr-(T|:41707)/PID:50567 esrally.driver.driver INFO Allocating worker [1] on [localhost] with [1] clients.
2023-02-07 20:30:25,412 ActorAddr-(T|:41707)/PID:50567 esrally.driver.driver INFO Allocating worker [2] on [localhost] with [1] clients.
2023-02-07 20:30:25,419 ActorAddr-(T|:41707)/PID:50567 esrally.driver.driver INFO Allocating worker [3] on [localhost] with [1] clients.
2023-02-07 20:30:25,424 ActorAddr-(T|:41707)/PID:50567 esrally.driver.driver INFO Allocating worker [4] on [localhost] with [1] clients.
2023-02-07 20:30:25,430 ActorAddr-(T|:41707)/PID:50567 esrally.driver.driver INFO Allocating worker [5] on [localhost] with [1] clients.
2023-02-07 20:30:25,436 ActorAddr-(T|:41707)/PID:50567 esrally.driver.driver INFO Allocating worker [6] on [localhost] with [1] clients.
2023-02-07 20:30:25,442 ActorAddr-(T|:41707)/PID:50567 esrally.driver.driver INFO Allocating worker [7] on [localhost] with [1] clients.

None of this explains why you had ~600+ esrally processes running, and even stranger is that they don't even appear to have been created properly. When terminating processes, Rally will look for any process named esrally, rally, or any Python process with esrally or esrallyd in the command line.

Your logs show this:

2023-02-07 19:59:20,570 -not-actor-/PID:50074 esrally.utils.process INFO Killing lingering process with PID [3466] and command line [[]].
2023-02-07 19:59:23,574 -not-actor-/PID:50074 esrally.utils.process INFO Killing lingering process with PID [3467] and command line [[]].
2023-02-07 19:59:26,578 -not-actor-/PID:50074 esrally.utils.process INFO Killing lingering process with PID [3470] and command line [[]].
2023-02-07 19:59:29,581 -not-actor-/PID:50074 esrally.utils.process INFO Killing lingering process with PID [3471] and command line [[]].
2023-02-07 19:59:32,585 -not-actor-/PID:50074 esrally.utils.process INFO Killing lingering process with PID [3505] and command line [[]].
[...]

However, I'd expect them to look like this (note the missing command lin args):

2023-02-08 23:52:23,377 -not-actor-/PID:76605 esrally.rally INFO Killing running Rally processes
2023-02-08 23:52:23,395 -not-actor-/PID:76605 esrally.utils.process INFO Killing lingering process with PID [76422] and command line [['/home/brad/rally/test/bin/python', '/home/brad/rally/test/bin/esrally', 'race', '--track=geonames', '--on-error=abort', '--challenge=append-no-conflicts-index-only', '--pipeline=benchmark-only', '--target-hosts=localhost:9200', '--track-params=bulk_indexing_clients:32', "--client-options=use_ssl: true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'", '--kill-running-processes']].
2023-02-08 23:52:24,400 -not-actor-/PID:76605 esrally.utils.process INFO Killing lingering process with PID [76423] and command line [['/home/brad/rally/test/bin/python', '/home/brad/rally/test/bin/esrally', 'race', '--track=geonames', '--on-error=abort', '--challenge=append-no-conflicts-index-only', '--pipeline=benchmark-only', '--target-hosts=localhost:9200', '--track-params=bulk_indexing_clients:32', "--client-options=use_ssl: true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'", '--kill-running-processes']].

Can you share more on the environment (Containerised?), invocation (what esrally race [...] command) and anything else?

Can you reproduce this?

Hi @Bradley_Deam , sorry for the late reply.

The test env is a k8s pod, node info:

architecture=amd64
containerRuntimeVersion=docker://19.3.9
kernelVersion=4.15.0-2066-azure-fips
kubeProxyVersion=v1.20.15
kubeletVersion=v1.20.15
operatingSystem=linux
osImage=Ubuntu 18.04.6 LTS

Resource

resources:
limits:
cpu: '4'
memory: 16Gi
requests:
cpu: '4'
memory: 16Gi

Yes, I can reproduce, actually this happened every time..
I think the number of process are related to the number of indices. From older logs, there were only a few INFO Killing lingering process with when there were only several thousands indices, and it keeps growing when I added more indices.

The command i'm using is esrally race --kill-running-processes --challenge=query-update-parallel --pipeline=benchmark-only --target-hosts={host} --track-path=/root/esrally-benchmark/track.json --user-tag="my-test:poc" --client-options="{certs..}"

The way I'm generating the test search/update request may be a bit different.. I want to send request to a random index, all index names are like {index_version}~{entity_id}, for example v1~123. Each time I build 1k indices with same data(include same set of "entity_id" but different "index_version", then I randomly pick an "index_version" when generating a test request, I did this in the track.py:

(simplified code)

SAMPLED_QUERIES_PATH = "/root/all-queries.txt"
SAMPLED_DOC_IDS_PATH = "/root/doc-ids.txt"
# Put all index versions here.
INDEX_VERSIONS = ["v1", "v2", "v3"]

class QuerySource:
    def __init__(self, track, params, **kwargs):
        self.sampled_queries = []
        # Load all sampled queries into a list.
        # ... ... code that read the file

    def partition(self, partition_index, total_partitions):
        return self

    def params(self):
        rand_query = random.choice(self.sampled_queries)
        rand_index_version = random.choice(INDEX_VERSIONS)
        return {
            "body": rand_query.query,
            "index": get_index_name(rand_index_version, rand_query.entity_id),
        }


class UpdateByQueryRunner:
    def __init__(self):
        self.doc_ids = []
        # Load all doc ids into a list.
        self.sampled_doc_ids = []
        # ... ... code that read the file

        self.query = {
            "script": "if (ctx._source.lastEditedAt == null) { ctx._source.lastEditedAt = 0;} "
                      "else { ctx._source.lastEditedAt++; }"
        }

    async def __call__(self, es, params):
        rand_doc = random.choice(self.sampled_doc_ids)
        rand_index_version = random.choice(INDEX_VERSIONS)
        index = get_index_name(rand_index_version, rand_doc.entity_id)
        await es.update(id=rand_doc.doc_id, body=self.query, index=index, retry_on_conflict=10)

    def __repr__(self, *args, **kwargs):
        return "update-by-query-runner"

def get_index_name(index_version, entity_id):
    return "{}~{}".format(index_version, entity_id)


def register(registry):
    registry.register_param_source("query-source", QuerySource)
    registry.register_runner("rand-update-by-query", UpdateByQueryRunner(), async_runner=True)

Not sure if this will cause some issues..

Actually, I found after I reboot the test env, those kill processes are gone..

I think it's the number of runs, the more I run the benchmark, the more processes to kill..
Once I reboot the test env, it goes back to normal.

1 Like

Thanks for the details, this is very odd!

I overlooked the version you're using here (2023-02-07 19:59:20,514 -not-actor-/PID:50074 esrally.rally INFO Rally version [2.2.1]), can you please try and reproduce on the latest release (2.7.0)? There's been quite a few bugs/issues fixed since 2.2.1 (June 2021).

If you're able to reproduce this on 2.7.0 can you please post (or privately PM me) the track in full (including the corpora)? A tarball would be great.

Oh, thanks for following up.

If you're using a containerised deployment of Rally I'm not sure how previous processes remaining lingering after each termination? This implies that the container is re-used and that each shutdown is not 'clean'.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.