Cannot race, worker has exited prematurely

When using Elasticsearch, I keep getting errors of this sort.

$ esrally race --track-path=. --pipeline=benchmark-only --target-hosts="https://10.43.34.12:9200" --client-options="basic_auth_user:'elastic',basic_auth_password:'changeme',use_ssl:true,verify_certs:false,timeout:60" --kill-running-processes

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Racing on track [input_json] and car ['external'] with version [8.14.1].

[WARNING] merges_total_time is 129079 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 95396 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 291269 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 23777 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[ERROR] Cannot race. Worker [2] has exited prematurely.

Getting further help:
*********************
* Check the log files in /home/vtokas/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/2.2.1/.
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /home/vtokas/.rally/logs.

--------------------------------
[INFO] FAILURE (took 14 seconds)
--------------------------------

This only seems to happen when I have a large number of workers, such as 64 or 128. One interesting observation is that this issue does not occur on older versions of Rally such as 1.4.1 or 2.0.3, but since 2.1.0 onwards, I get this error when testing. Does anyone have any insight into why this may be the case?

Any advice on the topic would be appreciated. I am not sure how to attach rally.log, but I will attach them in a response

Rally log here - rally log - Pastebin.com

Hello, have you tried with Rally version 2.11.0? There appears to be a fairly large number of workers indicated in the logs. Please try:

  • Rally 2.11.0
  • Cutting the number of workers in half via available.cores system configuration
  • Increasing the client timeout from 60 to 240

I am not sure if any one of these suggestions will work, however, I am hoping it will get us closer to understanding the issue.

Thank you,
Jason

Hey Jason, thanks a lot for the response. So I have used Rally 2.11.0 as well, and it has the same issue with the same phenomenon - no reason for exiting in the logs. For 96 workers I was able to get the run to start 1 out of 5 times. Rally 2.0.3 is the last version that does not have this issue. Every version thereafter I have been able to test (including 2.11) gives me this issue.
I agree that the number of workers is quite large, but current benchmarking indicates my cluster can handle the ingest load with this configuration, and I want to find out how far it can go.
As for your suggestions

  • Reducing number of workers did seem to work and I stopped seeing any errors but it also reduces the upload speed that I am able to achieve, meaning I can't load test at higher bandwidth
  • Timeout did not make a difference, I got the same error with 240s

I would appreciate any help with this issue, as I have no idea of the cause, only that it happens at large worker numbers (64 or 128)

Hi Varun, I see that you are using Python 3.9.19. Would you mind trying with the latest stable release?

Could you share spec details about the target Elasticsearch cluster? Specifically, I am interested in the number of nodes, CPU cores + memory per node, disk size and type, and the # primaries & replicas for the target index.

Thanks,
Jason

Hi Jason, apologies for the delayed response.

As for the Python version, I used Python 3.12 with esrally, and got the same behaviour.

Spec details

  • 9 nodes, each node has all roles (no dedicated nodes for anything right now)
  • 192 CPU cores per node
  • 1TB RAM per node, I am currently using 120GiB Java heap per node as I did not observe any performance degradation, although using it with 30GiB also gives the same error
  • Disks are netowrk-mounted RAID SSDs, although this issue was similarly observed on both spinning disks and local NVMe and SATA SSDs
  • 1 primary, no replicas

Here's the Rally log with Python 3.12, in case you need it py3.12-esrally - Pastebin.com