I was following the instructions here https://esrally.readthedocs.io/en/stable/recipes.html to benchmark a remote cluster. I have 2 EC2 instances with everything set up to run rally. The esrallyd commands run without any errors. When I run
esrally --track=metricbeat --report-format=csv --report-file=~/result.csv --target-hosts=10.0.1.51:9200,10.0.1.9:9200 --distribution-version=7.4.0 &
on the coordinator (.51), Rally gets stuck at "[INFO] Preparing for race ..." and I have left it to run at that for much longer than that test has taken on just a single node. When I look at rally.log on the coordinator the final three lines say
2019-10-18 16:02:54,665 ActorAddr-(T|:33065)/PID:5992 esrally.actor INFO Checking capabilities [{'coordinator': True, 'ip': '10.0.1.51', 'Convention Address.IPv4': '10.0.1.51:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1571414555518'}] against requirements [{'ip': '10.0.1.9'}] failed.
2019-10-18 16:02:54,666 ActorAddr-(T|:1900)/PID:5852 esrally.actor INFO Checking capabilities [{'coordinator': True, 'ip': '10.0.1.51', 'Convention Address.IPv4': '10.0.1.51:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1571414555518'}] against requirements [{'ip': '10.0.1.9'}] failed.
2019-10-18 16:02:54,666 ActorAddr-(T|:1900)/PID:5852 esrally.actor INFO Capabilities [{'coordinator': False, 'ip': '10.0.1.9', 'Convention Address.IPv4': '10.0.1.51:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1571414562227'}] match requirements [{'ip': '10.0.1.9'}].
I looked at Benchmarking a remote cluster - provisioning does not appear to be working? but my ports are all open, and my errors don't match theirs.
What is rally doing, and how can I figure out what is taking so long?
Update:
I changed the rally command to
esrally --track=metricbeat --report-format=csv --report-file=~/result.csv --target-hosts=10.0.1.9:9200 --distribution-version=7.4.0
and it is no longer hanging, but I get the following error message:
[ERROR] Cannot race. ('pid file not available after 60 seconds!', None)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/mechanic.py", line 590, in receiveMsg_StartNodes
nodes = self.mechanic.start_engine()
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/mechanic.py", line 725, in start_engine
self.nodes = self.launcher.start(node_configs)
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 299, in start
return [self._start_node(node_configuration, node_count_on_host) for node_configuration in node_configurations]
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 299, in
return [self._start_node(node_configuration, node_count_on_host) for node_configuration in node_configurations]
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 328, in _start_node
node_pid = self._start_process(binary_path, env)
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 373, in _start_process
return wait_for_pidfile("./pid")
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 274, in wait_for_pidfile
raise exceptions.LaunchError(msg)
esrally.exceptions.LaunchError: ('pid file not available after 60 seconds!', None)