Benchmarking Remote Cluster Stalls in Preparing

toneyalexander · October 18, 2019, 4:15pm

I was following the instructions here https://esrally.readthedocs.io/en/stable/recipes.html to benchmark a remote cluster. I have 2 EC2 instances with everything set up to run rally. The esrallyd commands run without any errors. When I run

esrally --track=metricbeat --report-format=csv --report-file=~/result.csv --target-hosts=10.0.1.51:9200,10.0.1.9:9200 --distribution-version=7.4.0 &

on the coordinator (.51), Rally gets stuck at "[INFO] Preparing for race ..." and I have left it to run at that for much longer than that test has taken on just a single node. When I look at rally.log on the coordinator the final three lines say

2019-10-18 16:02:54,665 ActorAddr-(T|:33065)/PID:5992 esrally.actor INFO Checking capabilities [{'coordinator': True, 'ip': '10.0.1.51', 'Convention Address.IPv4': '10.0.1.51:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1571414555518'}] against requirements [{'ip': '10.0.1.9'}] failed.

2019-10-18 16:02:54,666 ActorAddr-(T|:1900)/PID:5852 esrally.actor INFO Checking capabilities [{'coordinator': True, 'ip': '10.0.1.51', 'Convention Address.IPv4': '10.0.1.51:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1571414555518'}] against requirements [{'ip': '10.0.1.9'}] failed.

2019-10-18 16:02:54,666 ActorAddr-(T|:1900)/PID:5852 esrally.actor INFO Capabilities [{'coordinator': False, 'ip': '10.0.1.9', 'Convention Address.IPv4': '10.0.1.51:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1571414562227'}] match requirements [{'ip': '10.0.1.9'}].

I looked at Benchmarking a remote cluster - provisioning does not appear to be working? but my ports are all open, and my errors don't match theirs.

What is rally doing, and how can I figure out what is taking so long?

Update:
I changed the rally command to
esrally --track=metricbeat --report-format=csv --report-file=~/result.csv --target-hosts=10.0.1.9:9200 --distribution-version=7.4.0

and it is no longer hanging, but I get the following error message:

[ERROR] Cannot race. ('pid file not available after 60 seconds!', None)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/mechanic.py", line 590, in receiveMsg_StartNodes
nodes = self.mechanic.start_engine()
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/mechanic.py", line 725, in start_engine
self.nodes = self.launcher.start(node_configs)
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 299, in start
return [self._start_node(node_configuration, node_count_on_host) for node_configuration in node_configurations]
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 299, in
return [self._start_node(node_configuration, node_count_on_host) for node_configuration in node_configurations]
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 328, in _start_node
node_pid = self._start_process(binary_path, env)
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 373, in _start_process
return wait_for_pidfile("./pid")
File "/usr/local/lib/python3.6/site-packages/esrally/mechanic/launcher.py", line 274, in wait_for_pidfile
raise exceptions.LaunchError(msg)
esrally.exceptions.LaunchError: ('pid file not available after 60 seconds!', None)

Christian_Dahlqvist · October 19, 2019, 8:12am

If you already have a cluster up and running, should you not user the--benchmark-only flag?

toneyalexander · October 21, 2019, 2:09pm

I have the instances that will be in the cluster running, but I don't have elasticsearch installed on any of the machines - I want rally to do the provisioning so that it can gather metrics like CPU usage.

dliappis · October 21, 2019, 3:11pm

Hello,

Seems there's some issue starting Elasticsearch on the target nodes.
Can you check the Rally log file on the target nodes and paste back here (ensuring no sensitive data is shown)?

Dimitris

toneyalexander · October 21, 2019, 3:23pm

Sure thing. I've changed out the machines and added a node, so the IP addresses are different, but I'm still getting the same error messages.

2019-10-21 14:57:13,966 -not-actor-/PID:5810 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'coordinator': False, 'ip': '10.0.1.242', 'Convention Address.IPv4': '10.0.1.107:1900'}].
2019-10-21 14:57:14,8 -not-actor-/PID:5815 root INFO ++++ Actor System gen (3, 9) started, admin @ ActorAddr-(T|:1900)
2019-10-21 15:07:20,656 ActorAddr-(T|:1900)/PID:5815 esrally.actor INFO Capabilities [{'coordinator': False, 'ip': '10.0.1.242', 'Convention Address.IPv4': '10.0.1.107:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1571669834000'}] match requirements [{'ip': '10.0.1.242'}].
2019-10-21 15:07:20,656 ActorAddr-(T|:1900)/PID:5815 esrally.actor INFO Capabilities [{'coordinator': False, 'ip': '10.0.1.242', 'Convention Address.IPv4': '10.0.1.107:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1571669834000'}] match requirements [{'ip': '10.0.1.242'}].
2019-10-21 15:07:20,666 ActorAddr-(T|:40713)/PID:31558 esrally.actor INFO Starting node(s) [0] on [10.0.1.242].
2019-10-21 15:07:20,668 ActorAddr-(T|:40713)/PID:31558 esrally.metrics INFO Opening metrics store for trial timestamp=[20191021T145744Z], track=[metricbeat], challenge=[append-no-conflicts], car=[['defaults']]
2019-10-21 15:07:20,899 ActorAddr-(T|:40713)/PID:31558 esrally.utils.repo INFO Checking out revision [0a0a7bc] in [/home/ec2-user/.rally/benchmarks/teams/default].
2019-10-21 15:07:20,939 ActorAddr-(T|:40713)/PID:31558 esrally.utils.process INFO b'HEAD is now at 0a0a7bc... Revert "Parameterize OS name + CPU arch in download URL. (#32)"\n'
2019-10-21 15:07:20,942 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.java_resolver INFO Allowed JDK versions are [12, 11, 8].
2019-10-21 15:07:21,382 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.java_resolver INFO Detected JDK with major version [8] in [/usr/java/latest].
2019-10-21 15:07:21,382 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.supplier INFO Resolved download URL [https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.4.0-linux-x86_64.tar.gz] for version [7.4.0]
2019-10-21 15:07:21,383 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.supplier INFO Skipping download for version [7.4.0]. Found an existing binary at [/home/ec2-user/.rally/benchmarks/distributions/elasticsearch-oss-7.4.0-linux-x86_64.tar.gz].
2019-10-21 15:07:21,384 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.provisioner INFO Rally will delete the benchmark candidate after the benchmark
2019-10-21 15:07:21,384 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.provisioner INFO Preparing candidate locally in [/home/ec2-user/.rally/benchmarks/races/e2357f16-4fc8-4527-b26a-1a03dfeef385/rally-node-0/install].
2019-10-21 15:07:21,386 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.provisioner INFO Unzipping /home/ec2-user/.rally/benchmarks/distributions/elasticsearch-oss-7.4.0-linux-x86_64.tar.gz to /home/ec2-user/.rally/benchmarks/races/e2357f16-4fc8-4527-b26a-1a03dfeef385/rally-node-0/install
2019-10-21 15:07:25,256 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.provisioner INFO Deleting pre-bundled Elasticsearch configuration at [/home/ec2-user/.rally/benchmarks/races/e2357f16-4fc8-4527-b26a-1a03dfeef385/rally-node-0/install/elasticsearch-7.4.0/config]
2019-10-21 15:07:25,257 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.provisioner INFO Reading config template file [/home/ec2-user/.rally/benchmarks/teams/default/cars/v1/vanilla/templates/config/log4j2.properties] and writing to [/home/ec2-user/.rally/benchmarks/races/e2357f16-4fc8-4527-b26a-1a03dfeef385/rally-node-0/install/elasticsearch-7.4.0/config/log4j2.properties].
2019-10-21 15:07:25,265 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.provisioner INFO Reading config template file [/home/ec2-user/.rally/benchmarks/teams/default/cars/v1/vanilla/templates/config/jvm.options] and writing to [/home/ec2-user/.rally/benchmarks/races/e2357f16-4fc8-4527-b26a-1a03dfeef385/rally-node-0/install/elasticsearch-7.4.0/config/jvm.options].
2019-10-21 15:07:25,271 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.provisioner INFO Reading config template file [/home/ec2-user/.rally/benchmarks/teams/default/cars/v1/vanilla/templates/config/elasticsearch.yml] and writing to [/home/ec2-user/.rally/benchmarks/races/e2357f16-4fc8-4527-b26a-1a03dfeef385/rally-node-0/install/elasticsearch-7.4.0/config/elasticsearch.yml].
2019-10-21 15:07:25,278 ActorAddr-(T|:40713)/PID:31558 esrally.utils.process INFO Killing all processes which match [java], [elasticsearch] and [/home/ec2-user/.rally/benchmarks/races]
2019-10-21 15:07:25,287 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.java_resolver INFO Allowed JDK versions are [12, 11, 8].
2019-10-21 15:07:25,476 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.java_resolver INFO Detected JDK with major version [8] in [/usr/java/latest].
2019-10-21 15:07:25,476 ActorAddr-(T|:40713)/PID:31558 esrally.mechanic.launcher INFO Starting node [rally-node-0] based on car [defaults].
2019-10-21 15:07:25,646 ActorAddr-(T|:40713)/PID:31558 esrally.utils.process INFO b'future versions of Elasticsearch will require Java 11; your Java version from [/usr/java/jdk1.8.0_141/jre] does not meet this requirement\n'
2019-10-21 15:07:33,879 ActorAddr-(T|:40713)/PID:31558 esrally.utils.process INFO b'ERROR: [2] bootstrap checks failed\n'
2019-10-21 15:07:33,880 ActorAddr-(T|:40713)/PID:31558 esrally.utils.process INFO b'[1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65535]\n'
2019-10-21 15:07:33,880 ActorAddr-(T|:40713)/PID:31558 esrally.utils.process INFO b'[2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]\n'
2019-10-21 15:08:34,50 ActorAddr-(T|:40713)/PID:31558 root ERROR pid file not available after 60 seconds!
2019-10-21 15:08:34,50 ActorAddr-(T|:40713)/PID:31558 esrally.actor ERROR Cannot process message [<esrally.mechanic.mechanic.StartNodes object at 0x7fe7595d3198>]
Traceback (most recent call last):
  [This part is the same as the error message at the end of my original post]

The last lines seem to be where the main issue is.

dliappis · October 21, 2019, 4:31pm

That's an easy one: Elasticsearch can't start as the system configuration doesn't meet the requirements described in: https://www.elastic.co/guide/en/elasticsearch/reference/current/system-config.html

You'll need to increase the file descriptor limit and the vm.max_map_count settings on the target machines as described in the above link.

toneyalexander · October 21, 2019, 7:03pm

Great thanks for the help!

For anyone facing this issue in the future, the exact steps I took were as follows:

First I went into /etc/sysctl.conf and added the following lines:

vm.max_map_count = 262144
fs.file-max = 65535

(I'm not sure if that second one is necessary - I did some changes before this response so I'm not sure what is necessary)

Then I went into /etc/security/limits.conf and added

* hard nofile 65535

After I rebooted all the machines everything worked

system · November 18, 2019, 7:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Benchmarking a remote cluster - provisioning does not appear to be working? Elasticsearch rally	10	1766	February 6, 2018
Cluster provision and benchmark stalls using Rally Elasticsearch rally	21	971	April 14, 2020
Remote Cluster Benchmark test run fails to start Elasticsearch rally	4	1724	January 16, 2019
Rally stalls when I try to benchmark Elasticsearch cluster Elasticsearch docker , rally	4	659	September 7, 2020
Benchmarking an existing cluster attempt ending with a crash Elasticsearch rally	4	575	June 21, 2018

Benchmarking Remote Cluster Stalls in Preparing

Related topics