Benchmarking a remote cluster - provisioning does not appear to be working?

Hi

I am following the docs to be able to benchmark a remote cluster as per: Tips and Tricks - Rally 2.10.0 documentation. It is my understanding that this should be able to create the ES node on the benchmark candidate node. In the docs it states:

Contrary to the previous recipe, you want Rally to provision all cluster nodes

I have two EC2 instances that I have created: 10.0.0.223 (rally coordinating node) and 10.0.0.131 (rally benchmark candidate node). I have installed and configured esrally on both systems, they both have java 8 installed. I started the the esrallyd on the coordinating node (10.0.0.223) and then started esrallyd on the benchmark candidate node (10.0.0.131). The status of esrallyd on each node is reported as started for both.

I attempt to run the following and it downloads the tracks into the ~/.rally/benchmarks/tracks but freezes up after that:
esrally --distribution-version=6.0.0 --target-hosts=10.0.0.131:9200

Is there anything I am doing that is incorrect or a way to be able to know if the daemon is provisioning the remote node?

These are the last few log entries from the coordinating node and it just seems to stop:

2018-01-08 21:58:31,655 PID:9837 rally.racecontrol INFO ActorAddr-(T|:40586) => Asking mechanic to start the engine. 2018-01-08 21:58:31,659 PID:9838 rally.mechanic INFO ActorAddr-(T|:46010) => Received signal from race control to start engine. 2018-01-08 21:58:31,660 PID:9838 rally.metrics INFO ActorAddr-(T|:46010) => Opening metrics store for invocation=[20180108T215831Z], track=[percolator], challenge=[append-no-conflicts], car=[['defaults']] 2018-01-08 21:58:31,660 PID:9838 rally.mechanic INFO ActorAddr-(T|:46010) => Cluster consisting of [{'host': '10.0.0.131'}] will be provisioned by Rally. 2018-01-08 21:58:31,660 PID:9838 rally.mechanic INFO ActorAddr-(T|:46010) => Benchmarking against [{'host': '10.0.0.131'}] with external Rally daemon. 2018-01-08 21:58:31,662 PID:9838 rally.mechanic INFO ActorAddr-(T|:46010) => Actor system on [10.0.0.131] already running? [True]

Hi,

upon first glance it looks fine what you did; also your expectations (Does Rally create and start the ES nodes?) are correct.

Did you ensure that both machines (10.0.0.223 and 10.0.0.131) can talk to each other? One way to check whether both nodes really see each other is to look into ~/.rally/logs/actor-system-internal.log. On the coordinating node, you should see a line similar to:

Got Convention registration from ActorAddr-(T|10.0.0.131:1900) (re-registering) (new? False)

On the target machine (10.0.0.131) you should ensure that the machine can connect to the Internet. You can also inspect ~/.rally/logs/rally-actor-messages.log to see whether any error have occurred (it should report them back to the coordinating node though and not just hang).

I just spun up some new EC2 instances and started from scratch. esrallyd is started on each of the nodes (taking care to ensure that the esrallyd was started on the coordinator first). I confirmed that port 1900 is open and I can telnet to that port. I assume that the esrallyd is it using TCP for communication

Coordinator node:

ubuntu@ip-10-0-0-99:~$ esrallyd start --node-ip=10.0.0.99 --coordinator-ip=10.0.0.99
[INFO] Successfully started actor system on node [10.0.0.99] with coordinator node IP [10.0.0.99]

Candidate node:

ubuntu@ip-10-0-0-22:~$ esrallyd start --node-ip=10.0.0.22 --coordinator-ip=10.0.0.99
[INFO] Successfully started actor system on node [10.0.0.22] with coordinator node IP [10.0.0.99]

There are one 2 log files on both nodes: rally-actor-messages.log and rally-actors.log. The former has a single line and the later file is empty on both machines. actor-system-internal.log does not exist on either.

ubuntu@ip-10-0-0-99:~$ tail -f .rally/logs/* ==> .rally/logs/rally-actor-messages.log <== 2018-01-09 13:19:32,275 PID:6101 Thespian.Admin INFO ActorAddr-(T|:1900) => ActorSystem Administrator startup @ ActorAddr-(T|:1900)

==> .rally/logs/rally-actors.log <==

Running the esrally command I get the following and it just pauses after downloading the tracks into ~/.rally/benchmarks/tracks

ubuntu@ip-10-0-0-99:~$ esrally --distribution-version=6.0.0 --target-hosts=10.0.0.22:9200

[INFO] Writing logs to /home/ubuntu/.rally/logs/rally_out_20180109T134934Z.log
Cloning into '/home/ubuntu/.rally/benchmarks/tracks/default'...
remote: Counting objects: 1859, done.
remote: Compressing objects: 100% (71/71), done.
remote: Total 1859 (delta 78), reused 59 (delta 41), pack-reused 1746
Receiving objects: 100% (1859/1859), 874.25 KiB | 0 bytes/s, done.
Resolving deltas: 100% (1303/1303), done.
Checking connectivity... done.

It looks like it is stalled with whatever the next step after cloning the tracks. FWIW, I was able to do a git clone of the teams repo manually without an issue

This is correct. Rally uses port 1900 for internal communication.

The interesting log file for you is rally-actor-messages.log. As Rally is progressing, it should contain what it is doing on that node. I forgot that actor-system-internal.log is only available in development, my bad. However, there should exist a file /tmp/thespian.log which serves the same purpose (in development Rally just redefines the name of this file to ~/.rally/logs/actor-system-internal.log. So you can check there whether you see on 10.0.0.99in/tmp/thespian.log` something like

Got Convention registration from ActorAddr-(T|10.0.0.22:1900) (re-registering) (new? False)

It should actually say a bit more in the logs than on the terminal.

The sequence is:

  1. Load the track (and as part of that clone the repo if that has not happened)
  2. Connect to the metrics store (either in-memory or a dedicated ES metrics store)
  3. Prepare and start Elasticsearch (preparation includes download of an ES distribution from the official download URL)

It is a bit involved but gdb can help you find out why (and where) something is hanging if you don't have any other clues (e.g. log file output):

  1. Install gdb and the Python debug symbols (assuming Ubuntu): sudo apt-get install -y gdb python3-dbg
  2. Find the pid to attach to (e.g. with ps -ef | grep rally)
  3. Attach to the process: gdb python3 -p THE_PID.
  4. Type py-bt in gdb. You should get a stack trace that indicates where it is currently hanging.

Coordinator node:

ubuntu@ip-10-0-0-99:~$ tail -f /tmp/thespian.log
2018-01-09 15:51:28.400624 p9388 I    Got Convention registration from ActorAddr-(T|10.0.0.22:1900) (first time) (new? True)
2018-01-09 15:52:01.677400 p9388 I    Pending Actor request received for esrally.racecontrol.BenchmarkActor reqs {'coordinator': True} from ActorAddr-(T|:46125)
2018-01-09 15:52:01.681994 p9496 I    Starting Actor esrally.racecontrol.BenchmarkActor at ActorAddr-(T|:34783) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2018-01-09 15:52:01.691929 p9497 I    Starting Actor <class 'esrally.mechanic.mechanic.MechanicActor'> at ActorAddr-(T|:37304) (parent ActorAddr-(T|:34783), admin ActorAddr-(T|:1900))
2018-01-09 15:52:02.085309 p9388 I    Pending Actor request received for esrally.mechanic.mechanic.NodeMechanicActor reqs {'ip': '10.0.0.22'} from ActorAddr-(T|:37304)
2018-01-09 15:52:02.085757 p9388 I    Requesting creation of esrally.mechanic.mechanic.NodeMechanicActor on remote admin ActorAddr-(T|10.0.0.22:1900)

Benchmark Candidate node:

2018-01-09 15:51:28.391259 p7140 I    ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 8)
2018-01-09 15:51:28.398988 p7140 I    Admin registering with Convention @ ActorAddr-(T|10.0.0.99:1900) (first time)
2018-01-09 15:51:28.399530 p7140 I    Setting log aggregator of ActorAddr-(T|:42718) to ActorAddr-(T|10.0.0.99:1900)
2018-01-09 15:51:28.402735 p7140 I    Got Convention registration from ActorAddr-(T|10.0.0.99:1900) (re-registering) (new? True)
2018-01-09 15:52:02.087090 p7140 I    Pending Actor request received for esrally.mechanic.mechanic.NodeMechanicActor reqs {'ip': '10.0.0.22'} from ActorAddr-(T|10.0.0.99:1900)
2018-01-09 15:52:02.265427 p7151 I    Starting Actor esrally.mechanic.mechanic.NodeMechanicActor at ActorAddr-(T|:42390) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2018-01-09 15:54:09.502154 p7140 ERR  Socket error sending to ActorAddr-(T|10.0.0.99:37304) on <socket.socket fd=12, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.0.0.22', 38030)>: [Errno 110] Connection timed out / 110: ************* TransportIntent(ActorAddr-(T|10.0.0.99:37304)-pending-ExpiresIn_0:02:52.765740-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-(T|10.0.0.99:37304) inst# 0) errCode None actual ActorAddr-(T|:42390)-quit_0:02:52.765730)

It looks like it needs to have an additional port, 38030 open to communicate back to the coordinator

Socket error sending to ActorAddr-(T|10.0.0.99:37304) on <socket.socket fd=12, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.0.0.22', 38030)>: [Errno 110] Connection timed out / 110: ************* TransportIntent(ActorAddr-(T|10.0.0.99:37304)-pending-ExpiresIn_0:02:52.765740-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-(T|10.0.0.99:37304) inst# 0) errCode None actual ActorAddr-(T|:42390)-quit_0:02:52.765730)

I remember seeing in some docs or git that there are some other ports that might be needed. I am going to open up all access to all ports for these EC2 instances and see if that fixes things

It is not just this specific port. The actor system that Rally uses internally will start actors on arbitrary (unprivileged) ports. As you write, I'd recommend that you open unprivileged ports between the private interfaces of both machines (i.e. the 10.0.0.x network).

Opening up the ports fixed things! I just created a PR to update the docs

Glad to hear that the problem is resolved now.

The PR is already merged. Thanks! :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.