Remote Cluster Benchmark test run fails to start

ES cluster version 6.5.3.
When trying to run benchmarking remote cluster with http_logs, it fails with error:
Command issued, error message, rally.ini and error logs are provided below.

Thanks

Command to run benchmark:
esrally --distribution-version=6.5.3 --pipeline=benchmark-only --track=http_logs --target-hosts=10.236.133.163:9200,10.236.133.164:9200 --load-driver-hosts=10.236.133.84

Error:
[INFO] Racing on track [http_logs], challenge [append-no-conflicts] and car ['external'] with version [6.5.3].

[ERROR] Cannot race. Fatal track or load generator indication
Child Aborted

rally.ini file:
[meta]
config.version = 17

[system]
env.name = local

[node]
root.dir = /root/.rally/benchmarks
src.root.dir = /root/.rally/benchmarks/src

[source]
remote.repo.url = https://github.com/elastic/elasticsearch.git
elasticsearch.src.subdir = elasticsearch

[benchmarks]
local.dataset.cache = /root/.rally/benchmarks/data

[reporting]
datastore.type = in-memory
datastore.host =
datastore.port =
datastore.secure =
datastore.user =
datastore.password =

[tracks]
default.url = https://github.com/elastic/rally-tracks

[teams]
default.url = https://github.com/elastic/rally-teams

[defaults]
preserve_benchmark_candidate = False

[distributions]
release.cache = true

From rally.log:

2018-12-17 21:04:33,590 ActorAddr-(T|:40404)/PID:29106 esrally.actor INFO MechanicActor#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:41610)])
2018-12-17 21:04:33,608 ActorAddr-(T|:41610)/PID:29045 esrally.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:42707)] (ignoring).
2018-12-17 21:04:33,599 ActorAddr-(T|:40404)/PID:29106 esrally.actor ERROR Child actor exited with [ChildActorExited:ActorAddr-(T|:41758)] while in status [benchmark_started].
2018-12-17 21:04:33,615 ActorAddr-(T|:41610)/PID:29045 esrally.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:40404)] (ignoring).
2018-12-17 21:04:33,623 -not-actor-/PID:29044 root INFO ActorSystem Logging Shutdown
2018-12-17 21:04:33,644 -not-actor-/PID:29043 root INFO ---- Actor System shutdown
2018-12-17 21:04:33,645 -not-actor-/PID:29037 esrally.rally INFO Actor system is still running. Waiting...
2018-12-17 21:04:34,648 -not-actor-/PID:29037 esrally.rally INFO Shutdown completed.
2018-12-17 21:04:34,648 -not-actor-/PID:29037 esrally.rally ERROR Cannot run subcommand [race].
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/esrally/rally.py", line 454, in dispatch_sub_command
race(cfg)
File "/usr/local/lib/python3.5/dist-packages/esrally/rally.py", line 383, in race
with_actor_system(lambda c: racecontrol.run(c), cfg)
File "/usr/local/lib/python3.5/dist-packages/esrally/rally.py", line 404, in with_actor_system
runnable(cfg)
File "/usr/local/lib/python3.5/dist-packages/esrally/rally.py", line 383, in
with_actor_system(lambda c: racecontrol.run(c), cfg)
File "/usr/local/lib/python3.5/dist-packages/esrally/racecontrol.py", line 390, in run
raise e
File "/usr/local/lib/python3.5/dist-packages/esrally/racecontrol.py", line 387, in run
pipeline(cfg)
File "/usr/local/lib/python3.5/dist-packages/esrally/racecontrol.py", line 61, in call
self.target(cfg)
File "/usr/local/lib/python3.5/dist-packages/esrally/racecontrol.py", line 334, in benchmark_only
return race(cfg, external=True)
File "/usr/local/lib/python3.5/dist-packages/esrally/racecontrol.py", line 286, in race
raise exceptions.RallyError(result.message, result.cause)
esrally.exceptions.RallyError: ('Fatal track or load generator indication', 'Child Aborted')

Hello,

Firstly, please don't post unformatted code as it's very hard to read.

Instead paste the text and format it with </> icon, and check the preview
window to make sure it's properly formatted before posting it. This makes it
more likely that your question will receive a useful answer.

It would be great if you could update your post to solve this.

Now from what I could discern from the above text, your issue here is that you are using a wrong pipeline. As per the docs:

benchmark-only Assumes an already running Elasticsearch instance, runs a benchmark and reports results

However, you are also specifying --distribution-version which implies from-distribution pipeline i.e. asking Rally to launch ES 6.5.3 for you. Since you seem to be specifying IP addresses in target-hosts I presume your Elasticsearch cluster is already there and present so there's no need for Rally to spin one up for you, in which case don't use --distribution-version. I've noticed a couple of users mixing those up before, so we'll look at adding some more checks in Rally to bail out if --distribution-version is used together with an incompatible pipeline.

Additionally, I wonder about your use of --load-driver-hosts; are you running >1 load drivers (see: https://esrally.readthedocs.io/en/stable/recipes.html#recipe-distributed-load-driver) and using a separate machine as the coordinator? If yes, all machines need to run esrallyd with proper parameters (see the previous link); however, as you seem to be only one IP address in --load-driver-hosts so if this is the same IP as the machine you executed the esrally command on, you don't need it at all.

Finally a small good practice is that you shouldn't be running Rally (or any other use process) as root.

Here is updated info:

ES 6.5.3 provisioned remote cluster exists on target_hosts(data nodes) in green status and nodes are reachable on port 9200 from rally nodes with no data. Have multiple load driver hosts but for debugging purposes removed others from command . For this setup it has one coordinator node on ip 80 and load driver on IP 81. esrally version is 1.0.2 and OS is Ubuntu 16.04 on both nodes. esrallyd demons started on coordinator and load driver with these commands

On Coordinator Node:
esrallyd start --node-ip=10.236.133.80 --coordinator-ip=10.236.133.80
local@rally00:/rally$ esrallyd start --node-ip=10.236.133.80 --coordinator-ip=10.236.133.80
[INFO] Successfully started actor system on node [10.236.133.80] with coordinator node IP [10.236.133.80].

On load driver

esrallyd start --node-ip=10.236.133.81 --coordinator-ip=10.236.133.80
INFO] Successfully started actor system on node [10.236.133.81] with coordinator node IP [10.236.133.80]

Command Issued on coordinator node: (from documentation section Tips & Tricks for remote cluster distributed load)
esrally --pipeline=benchmark-only --track=http_logs --target-hosts=10.236.133.163:9200,10.236.133.164:9200 --load-driver-hosts=10.236.133.81

Configuration:

Rally Config:
local@rally00://home/local/.rally$ cat rally.ini
[meta]
config.version = 17

[system]
env.name = local

[node]
root.dir = /rally
src.root.dir = /rally/src

[source]
remote.repo.url = https://github.com/elastic/elasticsearch.git
elasticsearch.src.subdir = elasticsearch

[benchmarks]
local.dataset.cache = /rally/data

[reporting]
datastore.type = in-memory
datastore.host =
datastore.port =
datastore.secure =
datastore.user =
datastore.password =

[tracks]
default.url = https://github.com/elastic/rally-tracks

[teams]
default.url = https://github.com/elastic/rally-teams

[defaults]
preserve_benchmark_candidate = False

[distributions]
release.cache = true

Error reported in Log file:

2018-12-18 16:30:04,603 -not-actor-/PID:9895 esrally.rally INFO OS [posix.uname_result(sysname='Linux', nodename='rally00', release='4.4.0-140-generic', version='#166-Ubuntu SMP Wed Nov 14 20:09:47 UTC 2018', machine='x86_64')]
2018-12-18 16:30:04,604 -not-actor-/PID:9895 esrally.rally INFO Python [namespace(_multiarch='x86_64-linux-gnu', cache_tag='cpython-35', hexversion=50660080, name='cpython', version=sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0))]
2018-12-18 16:30:04,604 -not-actor-/PID:9895 esrally.rally INFO Rally version [1.0.2]
2018-12-18 16:30:04,604 -not-actor-/PID:9895 esrally.rally INFO Command line arguments: Namespace(advanced_config=False, assume_defaults=False, car='defaults', car_params='', challenge=None, client_options='timeout:60', configuration_name=None, distribution_repository='release', distribution_version='', effective_start_date=None, elasticsearch_plugins='', enable_driver_profiling=False, include_tasks=None, keep_cluster_running=False, laps=1, load_driver_hosts='10.236.133.81', offline=False, on_error='continue', pipeline='benchmark-only', plugin_params='', preserve_install='False', quiet=False, report_file='', report_format='markdown', revision='current', runtime_jdk=None, show_in_report='available', subcommand=None, target_hosts='10.236.133.163:9200,10.236.133.164:9200', team_path=None, team_repository='default', telemetry='', telemetry_params='', test_mode=False, track='http_logs', track_params='', track_path=None, track_repository='default', user_tag='')
........  
    2018-12-18 16:30:07,684 ActorAddr-(T|:40530)/PID:9900 esrally.metrics INFO Creating file race store
    2018-12-18 16:30:07,684 ActorAddr-(T|:40530)/PID:9900 esrally.actor INFO Asking mechanic to start the engine.
    2018-12-18 16:30:07,684 ActorAddr-(T|:40530)/PID:9900 esrally.actor INFO Capabilities [{'Thespian ActorSystem Version': 2, 'Python Version': (3, 5, 2, 'final', 0), 'coordinator': True, 'Thespian Version': '1545149308282', 'Thespian Watch Supported': True, 'Convention Address.IPv4': '10.236.133.80:1900', 'Thespian Generation': (3, 9), 'Thespian ActorSystem Name': 'multiprocTCPBase', 'ip': '10.236.133.80'}] match requirements [{'coordinator': True}].
    2018-12-18 16:30:07,789 ActorAddr-(T|:33967)/PID:9960 esrally.client INFO Creating ES client connected to [{'port': 9200, 'host': '10.236.133.163'}, {'port': 9200, 'host': '10.236.133.164'}] with options [{'retry-on-timeout': True, 'timeout': 60}]
    2018-12-18 16:30:07,716 ActorAddr-(T|:46768)/PID:9961 esrally.actor INFO Connecting to externally provisioned nodes on [[{'port': 9200, 'host': '10.236.133.163'}, {'port': 9200, 'host': '10.236.133.164'}]].
    .......
    2018-12-18 16:30:07,887 -not-actor-/PID:9895 esrally.racecontrol ERROR A benchmark failure has occurred
    2018-12-18 16:30:07,887 -not-actor-/PID:9895 esrally.racecontrol INFO Telling benchmark actor to exit.
    2018-12-18 16:30:07,888 -not-actor-/PID:9895 esrally.rally ERROR Cannot run subcommand [race].
    Traceback (most recent call last):
      File "/home/local/.local/lib/python3.5/site-packages/esrally/rally.py", line 454, in dispatch_sub_command
        race(cfg)
      File "/home/local/.local/lib/python3.5/site-packages/esrally/rally.py", line 383, in race
        with_actor_system(lambda c: racecontrol.run(c), cfg)
      File "/home/local/.local/lib/python3.5/site-packages/esrally/rally.py", line 404, in with_actor_system
        runnable(cfg)
      File "/home/local/.local/lib/python3.5/site-packages/esrally/rally.py", line 383, in <lambda>
        with_actor_system(lambda c: racecontrol.run(c), cfg)
      File "/home/local/.local/lib/python3.5/site-packages/esrally/racecontrol.py", line 390, in run
        raise e
      File "/home/local/.local/lib/python3.5/site-packages/esrally/racecontrol.py", line 387, in run
        pipeline(cfg)
      File "/home/local/.local/lib/python3.5/site-packages/esrally/racecontrol.py", line 61, in __call__
        self.target(cfg)
      File "/home/local/.local/lib/python3.5/site-packages/esrally/racecontrol.py", line 334, in benchmark_only
        return race(cfg, external=True)
      File "/home/local/.local/lib/python3.5/site-packages/esrally/racecontrol.py", line 286, in race
        raise exceptions.RallyError(result.message, result.cause)
    esrally.exceptions.RallyError: ('Fatal track or load generator indication', 'Child Aborted')

Let's troubleshoot this starting simply.

Can you kill the esrallyd processes on the load driver machine (10.236.133.81) and on this machine just run this command:

esrally --pipeline=benchmark-only --track=http_logs --target-hosts=10.236.133.163:9200,10.236.133.164:9200

If this works we can take a step further to diagnose the situation with distributed load drivers and communicator node.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.