I am following the docs to be able to benchmark a remote cluster as per: Tips and Tricks - Rally 2.10.0 documentation. It is my understanding that this should be able to create the ES node on the benchmark candidate node. In the docs it states:
Contrary to the previous recipe, you want Rally to provision all cluster nodes
I have two EC2 instances that I have created: 10.0.0.223 (rally coordinating node) and 10.0.0.131 (rally benchmark candidate node). I have installed and configured esrally on both systems, they both have java 8 installed. I started the the esrallyd on the coordinating node (10.0.0.223) and then started esrallyd on the benchmark candidate node (10.0.0.131). The status of esrallyd on each node is reported as started for both.
I attempt to run the following and it downloads the tracks into the ~/.rally/benchmarks/tracks but freezes up after that:
esrally --distribution-version=6.0.0 --target-hosts=10.0.0.131:9200
Is there anything I am doing that is incorrect or a way to be able to know if the daemon is provisioning the remote node?
These are the last few log entries from the coordinating node and it just seems to stop:
2018-01-08 21:58:31,655 PID:9837 rally.racecontrol INFO ActorAddr-(T|:40586) => Asking mechanic to start the engine.
2018-01-08 21:58:31,659 PID:9838 rally.mechanic INFO ActorAddr-(T|:46010) => Received signal from race control to start engine.
2018-01-08 21:58:31,660 PID:9838 rally.metrics INFO ActorAddr-(T|:46010) => Opening metrics store for invocation=[20180108T215831Z], track=[percolator], challenge=[append-no-conflicts], car=[['defaults']]
2018-01-08 21:58:31,660 PID:9838 rally.mechanic INFO ActorAddr-(T|:46010) => Cluster consisting of [{'host': '10.0.0.131'}] will be provisioned by Rally.
2018-01-08 21:58:31,660 PID:9838 rally.mechanic INFO ActorAddr-(T|:46010) => Benchmarking against [{'host': '10.0.0.131'}] with external Rally daemon.
2018-01-08 21:58:31,662 PID:9838 rally.mechanic INFO ActorAddr-(T|:46010) => Actor system on [10.0.0.131] already running? [True]
upon first glance it looks fine what you did; also your expectations (Does Rally create and start the ES nodes?) are correct.
Did you ensure that both machines (10.0.0.223 and 10.0.0.131) can talk to each other? One way to check whether both nodes really see each other is to look into ~/.rally/logs/actor-system-internal.log. On the coordinating node, you should see a line similar to:
Got Convention registration from ActorAddr-(T|10.0.0.131:1900) (re-registering) (new? False)
On the target machine (10.0.0.131) you should ensure that the machine can connect to the Internet. You can also inspect ~/.rally/logs/rally-actor-messages.log to see whether any error have occurred (it should report them back to the coordinating node though and not just hang).
I just spun up some new EC2 instances and started from scratch. esrallyd is started on each of the nodes (taking care to ensure that the esrallyd was started on the coordinator first). I confirmed that port 1900 is open and I can telnet to that port. I assume that the esrallyd is it using TCP for communication
Coordinator node:
ubuntu@ip-10-0-0-99:~$ esrallyd start --node-ip=10.0.0.99 --coordinator-ip=10.0.0.99
[INFO] Successfully started actor system on node [10.0.0.99] with coordinator node IP [10.0.0.99]
Candidate node:
ubuntu@ip-10-0-0-22:~$ esrallyd start --node-ip=10.0.0.22 --coordinator-ip=10.0.0.99
[INFO] Successfully started actor system on node [10.0.0.22] with coordinator node IP [10.0.0.99]
There are one 2 log files on both nodes: rally-actor-messages.log and rally-actors.log. The former has a single line and the later file is empty on both machines. actor-system-internal.log does not exist on either.
It looks like it is stalled with whatever the next step after cloning the tracks. FWIW, I was able to do a git clone of the teams repo manually without an issue
This is correct. Rally uses port 1900 for internal communication.
The interesting log file for you is rally-actor-messages.log. As Rally is progressing, it should contain what it is doing on that node. I forgot that actor-system-internal.log is only available in development, my bad. However, there should exist a file /tmp/thespian.log which serves the same purpose (in development Rally just redefines the name of this file to ~/.rally/logs/actor-system-internal.log. So you can check there whether you see on 10.0.0.99in/tmp/thespian.log` something like
Got Convention registration from ActorAddr-(T|10.0.0.22:1900) (re-registering) (new? False)
ubuntu@ip-10-0-0-99:~$ tail -f /tmp/thespian.log
2018-01-09 15:51:28.400624 p9388 I Got Convention registration from ActorAddr-(T|10.0.0.22:1900) (first time) (new? True)
2018-01-09 15:52:01.677400 p9388 I Pending Actor request received for esrally.racecontrol.BenchmarkActor reqs {'coordinator': True} from ActorAddr-(T|:46125)
2018-01-09 15:52:01.681994 p9496 I Starting Actor esrally.racecontrol.BenchmarkActor at ActorAddr-(T|:34783) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2018-01-09 15:52:01.691929 p9497 I Starting Actor <class 'esrally.mechanic.mechanic.MechanicActor'> at ActorAddr-(T|:37304) (parent ActorAddr-(T|:34783), admin ActorAddr-(T|:1900))
2018-01-09 15:52:02.085309 p9388 I Pending Actor request received for esrally.mechanic.mechanic.NodeMechanicActor reqs {'ip': '10.0.0.22'} from ActorAddr-(T|:37304)
2018-01-09 15:52:02.085757 p9388 I Requesting creation of esrally.mechanic.mechanic.NodeMechanicActor on remote admin ActorAddr-(T|10.0.0.22:1900)
Benchmark Candidate node:
2018-01-09 15:51:28.391259 p7140 I ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 8)
2018-01-09 15:51:28.398988 p7140 I Admin registering with Convention @ ActorAddr-(T|10.0.0.99:1900) (first time)
2018-01-09 15:51:28.399530 p7140 I Setting log aggregator of ActorAddr-(T|:42718) to ActorAddr-(T|10.0.0.99:1900)
2018-01-09 15:51:28.402735 p7140 I Got Convention registration from ActorAddr-(T|10.0.0.99:1900) (re-registering) (new? True)
2018-01-09 15:52:02.087090 p7140 I Pending Actor request received for esrally.mechanic.mechanic.NodeMechanicActor reqs {'ip': '10.0.0.22'} from ActorAddr-(T|10.0.0.99:1900)
2018-01-09 15:52:02.265427 p7151 I Starting Actor esrally.mechanic.mechanic.NodeMechanicActor at ActorAddr-(T|:42390) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2018-01-09 15:54:09.502154 p7140 ERR Socket error sending to ActorAddr-(T|10.0.0.99:37304) on <socket.socket fd=12, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.0.0.22', 38030)>: [Errno 110] Connection timed out / 110: ************* TransportIntent(ActorAddr-(T|10.0.0.99:37304)-pending-ExpiresIn_0:02:52.765740-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-(T|10.0.0.99:37304) inst# 0) errCode None actual ActorAddr-(T|:42390)-quit_0:02:52.765730)
It looks like it needs to have an additional port, 38030 open to communicate back to the coordinator
Socket error sending to ActorAddr-(T|10.0.0.99:37304) on <socket.socket fd=12, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.0.0.22', 38030)>: [Errno 110] Connection timed out / 110: ************* TransportIntent(ActorAddr-(T|10.0.0.99:37304)-pending-ExpiresIn_0:02:52.765740-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-(T|10.0.0.99:37304) inst# 0) errCode None actual ActorAddr-(T|:42390)-quit_0:02:52.765730)
I remember seeing in some docs or git that there are some other ports that might be needed. I am going to open up all access to all ports for these EC2 instances and see if that fixes things
It is not just this specific port. The actor system that Rally uses internally will start actors on arbitrary (unprivileged) ports. As you write, I'd recommend that you open unprivileged ports between the private interfaces of both machines (i.e. the 10.0.0.x network).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.