Cluster provision and benchmark stalls using Rally

Following the cluster benchmarking as per the Rally 1.4 documentation.I have three Virtual machines

Coordinator is 10.156.129.118 (vm-1)
Target are 10.156.129.121 (vm-2) and 10.156.129.169 (vm-4)

On Coordinator:

elastic@elastic-vm-1 tmp]$ esrallyd start --node-ip=10.156.129.118 --coordinator-ip=10.156.129.118
[INFO] Successfully started actor system on node [10.156.129.118] with coordinator node IP [10.156.129.118]

On Targets:

[elastic@elastic-vm-4 ~]$ esrallyd start --node-ip=10.156.129.169 --coordinator-ip=10.156.129.118
[INFO] Successfully started actor system on node [10.156.129.169] with coordinator node IP [10.156.129.118].

[elastic@elastic-vm-2 ~]$ esrallyd start --node-ip=10.156.129.121 --coordinator-ip=10.156.129.118
[INFO] Successfully started actor system on node [10.156.129.121] with coordinator node IP [10.156.129.118].

Then create cluster and run benchmark:It immediately proceeds for preparing to race but stuck there forever..

[elastic@elastic-vm-1 tmp]$ esrally --distribution-version=7.5.2 --target-hosts=10.40.195.121:9200,10.156.129.169:9200

[INFO] Preparing for race ...

rally.log:

    2020-03-12 00:27:57,440 ActorAddr-(T|:1900)/PID:1697 esrally.actor INFO Capabilities [{'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'coordinator': True, 'Python Version': (3, 5, 1, 'final', 0), 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Thespian Version': '1583972540563', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'ip': '10.156.129.118'}] match requirements [{'coordinator': True}].
    2020-03-12 00:27:58,249 ActorAddr-(T|:35236)/PID:1752 esrally.utils.repo INFO Checking out [7] in [/home/elastic/.rally/benchmarks/tracks/default] for distribution version [7.5.2].
    2020-03-12 00:27:58,260 ActorAddr-(T|:35236)/PID:1752 esrally.utils.process INFO Already on '7'
    Your branch is up-to-date with 'origin/7'.

    2020-03-12 00:27:58,260 ActorAddr-(T|:35236)/PID:1752 esrally.utils.repo INFO Rebasing on [7] in [/home/elastic/.rally/benchmarks/tracks/default] for distribution version [7.5.2].
    2020-03-12 00:27:58,269 ActorAddr-(T|:35236)/PID:1752 esrally.utils.process INFO Already on '7'
    Your branch is up-to-date with 'origin/7'.

    2020-03-12 00:27:58,314 ActorAddr-(T|:35236)/PID:1752 esrally.utils.process INFO Current branch 7 is up to date.

    2020-03-12 00:27:58,319 ActorAddr-(T|:35236)/PID:1752 esrally.track.loader INFO Reading track specification file [/home/elastic/.rally/benchmarks/tracks/default/geonames/track.json].
    2020-03-12 00:27:58,353 ActorAddr-(T|:35236)/PID:1752 esrally.track.loader INFO Final rendered track for '/home/elastic/.rally/benchmarks/tracks/default/geonames/track.json' has been written to '/tmp/tmpl3yggaos.json'.
    2020-03-12 00:27:58,361 ActorAddr-(T|:35236)/PID:1752 esrally.track.loader INFO Loading template [definition for index geonames in index.json].
    2020-03-12 00:27:58,367 ActorAddr-(T|:35236)/PID:1752 esrally.metrics INFO Creating in-memory metrics store
    2020-03-12 00:27:58,367 ActorAddr-(T|:35236)/PID:1752 esrally.metrics INFO Opening metrics store for race timestamp=[20200312T002756Z], track=[geonames], challenge=[append-no-conflicts], car=[['defaults']]
    2020-03-12 00:27:58,368 ActorAddr-(T|:35236)/PID:1752 esrally.metrics INFO Creating file race store
    2020-03-12 00:27:58,368 ActorAddr-(T|:35236)/PID:1752 esrally.actor INFO Asking mechanic to start the engine.
    2020-03-12 00:27:58,368 ActorAddr-(T|:35236)/PID:1752 esrally.actor INFO Capabilities [{'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'coordinator': True, 'Python Version': (3, 5, 1, 'final', 0), 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Thespian Version': '1583972540563', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'ip': '10.156.129.118'}] match requirements [{'coordinator': True}].
    2020-03-12 00:27:58,378 ActorAddr-(T|:32810)/PID:1816 esrally.actor INFO Received signal from race control to start engine.
    2020-03-12 00:27:58,962 ActorAddr-(T|:32810)/PID:1816 esrally.utils.repo INFO Checking out [7] in [/home/elastic/.rally/benchmarks/teams/default] for distribution version [7.5.2].
    2020-03-12 00:27:58,973 ActorAddr-(T|:32810)/PID:1816 esrally.utils.process INFO Already on '7'
    Your branch is up-to-date with 'origin/7'.

    2020-03-12 00:27:58,973 ActorAddr-(T|:32810)/PID:1816 esrally.utils.repo INFO Rebasing on [7] in [/home/elastic/.rally/benchmarks/teams/default] for distribution version [7.5.2].
    2020-03-12 00:27:58,984 ActorAddr-(T|:32810)/PID:1816 esrally.utils.process INFO Already on '7'
    Your branch is up-to-date with 'origin/7'.

    2020-03-12 00:27:59,21 ActorAddr-(T|:32810)/PID:1816 esrally.utils.process INFO Current branch 7 is up to date.

    2020-03-12 00:27:59,30 ActorAddr-(T|:32810)/PID:1816 esrally.actor INFO Cluster consisting of [{'host': '10.40.195.121', 'port': 9200}, {'host': '10.156.129.169', 'port': 9200}] will be provisioned by Rally.
    2020-03-12 00:27:59,31 ActorAddr-(T|:32810)/PID:1816 esrally.actor INFO Capabilities [{'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'coordinator': True, 'Python Version': (3, 5, 1, 'final', 0), 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Thespian Version': '1583972540563', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'ip': '10.156.129.118'}] match requirements [{}].
    2020-03-12 00:27:59,41 ActorAddr-(T|:41796)/PID:1880 esrally.actor INFO Remote Rally node [10.156.129.169] has started.
    2020-03-12 00:27:59,43 ActorAddr-(T|:1900)/PID:1697 esrally.actor INFO Checking capabilities [{'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'coordinator': True, 'Python Version': (3, 5, 1, 'final', 0), 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Thespian Version': '1583972540563', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'ip': '10.156.129.118'}] against requirements [{'ip': '10.156.129.169'}] failed.
    2020-03-12 00:27:59,41 ActorAddr-(T|:41796)/PID:1880 esrally.actor INFO Checking capabilities [{'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'coordinator': True, 'Python Version': (3, 5, 1, 'final', 0), 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Thespian Version': '1583972540563', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'ip': '10.156.129.118'}] against requirements [{'ip': '10.156.129.169'}] failed.
    2020-03-12 00:27:59,43 ActorAddr-(T|:1900)/PID:1697 esrally.actor INFO Capabilities [{'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'ip': '10.156.129.169', 'coordinator': False, 'Thespian Version': '1583972574587', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Python Version': (3, 5, 1, 'final', 0)}] match requirements [{'ip': '10.156.129.169'}].
  1. This was my second attempt to do this esrallyd start and benchmark test. I am not seeing one of the target 10.156.129.121 in "|Convention Attendees [1]:" even though esrallyd command was successful. See below thespial shell status output.
    I attempted to stop and start esrallyd across these servers but still see the same issue. Why is this happening ?

  2. In my first attempt of esrallyd start even though both targets (10.156.129.121 and 10.156.129.169) was showing under "Convention Attendees" still the Benchmark was stalled with similar message ("esrally.actor INFO Remote Rally node [10.156.129.121] has started ")in rally.log . ?

On the Linux servers the firewall is disabled so no port block issue and internet is accessible from all these VMs.

Any pointers to troubleshoot above 2 issues is appreciated.

[root@elastic-vm-1 ~]# tail -f /tmp/thespian.log

2020-03-11 17:22:36.816157 p1697 Warn Convention registration from ActorAddr-(T|:1900) is an invalid address; ignoring.
2020-03-11 17:32:43.176405 p1697 Warn Convention registration from ActorAddr-(T|:1900) is an invalid address; ignoring.

Thespian status
**********************

thespian> status
Requesting status from Actor (or Admin) @ ActorAddr-(T|:1900) (#0)
Status of ActorSystem @ ActorAddr-(T|:1900) [#0]:
  |Capabilities[9]:
     Thespian ActorSystem Version: 2
                   Python Version: (3, 5, 1, 'final', 0)
              Thespian Generation: (3, 9)
                               ip: 10.156.129.118
         Thespian Watch Supported: True
        Thespian ActorSystem Name: multiprocTCPBase
          Convention Address.IPv4: 10.156.129.118:1900
                 Thespian Version: 1583972540563
                      coordinator: True
  |Convention Leader: ActorAddr-(T|:1900) [#0]
  |Convention Notifications [1]:
    ActorAddr-(T|:41796) [#2]
  |Convention Attendees [1]:
    @ ActorAddr-(T|10.156.129.169:1900) [#1]: Expires_in_0:20:21.828543
  |Primary Actors [1]:
    @ ActorAddr-(T|:35236) [#3]
  |Rate Governer: Rate limit: 4480 messages/sec (currently low with 9 ticks)
  |Pending Messages [0]:
  |Received Messages [0]:
  |Pending Wakeups [0]:
  |Pending Address Resolution [0]:
  |>           9 - Actor.Message Send.Transmit Started
  |>           4 - Admin Handle Convention Registration
  |>          15 - Admin Message Received.Total
  |>           1 - Admin Message Received.Type.Pending Actor Request
  |>           2 - Admin Message Received.Type.QueryExists
  |>           3 - Admin Message Received.Type.StatusReq
  |> sock#2-fd10 - Idle-socket <socket.socket fd=10, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.156.129.118', 1900), raddr=('10.156.129.118', 35194)>->ActorAddr-(T|:35878) (Expires_in_0:12:36.158960)
  |> sock#3-fd11 - Idle-socket <socket.socket fd=11, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.156.129.118', 1900), raddr=('10.156.129.121', 52192)>->ActorAddr-(T|:1900) (Expires_in_0:17:20.289688)
  |> sock#1-fd12 - Idle-socket <socket.socket fd=12, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.156.129.118', 1900), raddr=('10.156.129.169', 47848)>->ActorAddr-(T|10.156.129.169:1900) (Expires_in_0:18:15.830708)
  |> sock#6-fd13 - Idle-socket <socket.socket fd=13, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.156.129.118', 1900), raddr=('10.156.129.118', 35208)>->ActorAddr-(T|:41962) (Expires_in_0:12:34.468386)
  |> sock#4-fd15 - Idle-socket <socket.socket fd=15, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.156.129.118', 1900), raddr=('10.156.129.118', 35282)>->ActorAddr-(T|:41796) (Expires_in_0:12:36.156473)
  |> sock#0-fd17 - Idle-socket <socket.socket fd=17, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.156.129.118', 1900), raddr=('10.156.129.118', 35218)>->ActorAddr-(T|:35236) (Expires_in_0:12:34.560533)
  |> sock#5-fd18 - Idle-socket <socket.socket fd=18, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 1900), raddr=('127.0.0.1', 39644)>->ActorAddr-(T|:37030) (Expires_in_0:19:59.999589)
  |DeadLetter Addresses [0]:
  |Source Authority: None
  |Loaded Sources [0]:
  |Global Actors [0]:

@palani FYI I reformatted your post for better readability.

1 Like

@dliappis , Yeah tried my best to do formatting properly looks like I missed few lines . Now it is more readable, thankyou :slight_smile:

@ven100

Can you paste the remaining contents of ~/.rally/rally.log from the coordinator, after the INFO Capabilities line?

There should be something like INFO Mechanic has started engine successfully. followed with, hopefully, information about the nature of the problem.

@dliappis There is no more logs, that is the last line in rally.log and it stalls there... I had to do Crtl+C to get kills this.

[elastic@elastic-vm-1 tmp]$ esrally --distribution-version=7.5.2 --target-hosts=10.40.195.121:9200,10.156.129.169:9200

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Preparing for race ...

@palani

Can you add debug logging? Just "level": "INFO" to "level": "DEBUG" in your ~/.rally/logging.json, rerun and share ~/.rally/logs/rally.log.

One other test, to verify connectivity:
Ensure all rally/rallyd processes are killed on all vms.

Then on target vms vm-2 and vm-4 run:

nc -k -v -l 1900

On the coordinator vm-1 run:

printf "message from coordinator to vm-2\n" | nc 10.156.129.121 1900
printf "message from coordinator to vn-4\n" | nc 10.156.129.169 1900

You should see both messages from the coordinator on the target nodes.

@ven100

I observed you are using port 9200; Rally uses the port 39200 by default (see: https://esrally.readthedocs.io/en/stable/configuration.html#configuration).

So try changing the port in your --target-hosts.

I can see whether the confusion comes from in the docs: https://esrally.readthedocs.io/en/stable/recipes.html#distributing-the-load-test-driver the port mentioned is 9200. I'll address this in a PR.

(Default ports can be changed via the --car-params and the property http_port.

@dliappis. Tried your suggestions added debug and port change similar behavior.

Connectivity check:
[elastic@elastic-vm-1 tmp] printf "message from coordinator to vm-2\n" | nc 10.156.129.121 1900 [elastic@elastic-vm-1 tmp] printf "message from coordinator to vn-4\n" | nc 10.156.129.169 1900

[elastic@elastic-vm-2 ~]$ nc -k -v -l 1900
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Listening on :::1900
Ncat: Listening on 0.0.0.0:1900
Ncat: Connection from 10.156.129.118.
Ncat: Connection from 10.156.129.118:39738.
message from coordinator to vm-2


[elastic@elastic-vm-4 ~]$ nc -k -v -l 1900
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Listening on :::1900
Ncat: Listening on 0.0.0.0:1900
Ncat: Connection from 10.156.129.118.
Ncat: Connection from 10.156.129.118:51932.
message from coordinator to vn-4

rally.log part 1 (I see there is a limit on amount of text in single post so part 2 log will follow in next post)

2020-03-13 18:23:38,833 -not-actor-/PID:4205 esrally.rally INFO OS [uname_result(system='Linux', node='elastic-vm-1', release='3.10.0-514.el7.x86_64', version='#1 SMP Tue Nov 22 16:42:41 UTC 2016', machine='x86_64', processor='x86_64')]
2020-03-13 18:23:38,833 -not-actor-/PID:4205 esrally.rally INFO Python [namespace(cache_tag='cpython-35', hexversion=50659824, name='cpython', version=sys.version_info(major=3, minor=5, micro=1, releaselevel='final', serial=0))]
2020-03-13 18:23:38,833 -not-actor-/PID:4205 esrally.rally INFO Rally version [1.4.0]
2020-03-13 18:23:38,833 -not-actor-/PID:4205 esrally.rally DEBUG Command line arguments: Namespace(advanced_config=False, assume_defaults=False, car='defaults', car_params='', challenge=None, client_options='timeout:60', configuration_name=None, distribution_repository='release', distribution_version='7.5.2', effective_start_date=None, elasticsearch_plugins='', enable_driver_profiling=False, exclude_tasks=None, include_tasks=None, keep_cluster_running=False, load_driver_hosts='localhost', offline=False, on_error='continue', pipeline='', plugin_params='', preserve_install='False', quiet=False, race_id='c0eccacf-d218-4c84-a798-0a91af8c6b38', report_file='', report_format='markdown', revision='current', runtime_jdk=None, show_in_report='available', skip_rest_api_check=False, subcommand=None, target_hosts='10.40.195.121:39200,10.156.129.169:39200', team_path=None, team_repository='default', team_revision=None, telemetry='', telemetry_params='', test_mode=False, track=None, track_params='', track_path=None, track_repository='default', track_revision=None, user_tag='')
2020-03-13 18:23:38,834 -not-actor-/PID:4205 esrally.utils.net INFO Connecting directly to the Internet (no proxy support).
2020-03-13 18:23:38,834 -not-actor-/PID:4205 esrally.utils.net DEBUG Checking for internet connection against [https://github.com/]
2020-03-13 18:23:39,182 -not-actor-/PID:4205 esrally.utils.net DEBUG Probing result is HTTP status [200]
2020-03-13 18:23:39,182 -not-actor-/PID:4205 esrally.rally INFO Detected a working Internet connection.
2020-03-13 18:23:39,208 -not-actor-/PID:4205 esrally.rally INFO Actor system already running locally? [True]
2020-03-13 18:23:39,208 -not-actor-/PID:4205 esrally.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2020-03-13 18:23:39,225 -not-actor-/PID:4205 esrally.racecontrol INFO User specified no pipeline. Automatically derived pipeline [from-distribution].
2020-03-13 18:23:39,226 -not-actor-/PID:4205 esrally.racecontrol INFO Using configured hosts [{'host': '10.40.195.121', 'port': 39200}, {'host': '10.156.129.169', 'port': 39200}]
2020-03-13 18:23:39,226 -not-actor-/PID:4205 esrally.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2020-03-13 18:23:39,305 ActorAddr-(T|:1900)/PID:4203 esrally.driver.scheduler DEBUG Registering object [<class 'esrally.driver.scheduler.DeterministicScheduler'>] for [deterministic].
2020-03-13 18:23:39,305 ActorAddr-(T|:1900)/PID:4203 esrally.driver.scheduler DEBUG Registering object [<class 'esrally.driver.scheduler.PoissonScheduler'>] for [poisson].
2020-03-13 18:23:39,310 ActorAddr-(T|:1900)/PID:4203 esrally.actor INFO Capabilities [{'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'coordinator': True, 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Python Version': (3, 5, 1, 'final', 0), 'ip': '10.156.129.118', 'Thespian Version': '1584123752039', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True}] match requirements [{'coordinator': True}].
2020-03-13 18:23:39,310 ActorAddr-(T|:1900)/PID:4203 esrally.actor INFO Capabilities [{'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'coordinator': True, 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Python Version': (3, 5, 1, 'final', 0), 'ip': '10.156.129.118', 'Thespian Version': '1584123752039', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True}] match requirements [{'coordinator': True}].
2020-03-13 18:23:39,318 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default --version] with logging.
2020-03-13 18:23:39,322 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG git version 2.9.3

2020-03-13 18:23:39,323 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default --version] finished with return code [0].
2020-03-13 18:23:39,323 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default fetch --prune --tags origin] with logging.
2020-03-13 18:23:39,928 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default fetch --prune --tags origin] finished with return code [0].
2020-03-13 18:23:39,929 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default --version] with logging.
2020-03-13 18:23:39,932 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG git version 2.9.3

2020-03-13 18:23:39,933 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default --version] finished with return code [0].
2020-03-13 18:23:39,933 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default for-each-ref refs/remotes/ --format='%(refname:short)'] with output.
2020-03-13 18:23:39,936 ActorAddr-(T|:37573)/PID:4212 esrally.utils.repo INFO Checking out [7] in [/home/elastic/.rally/benchmarks/tracks/default] for distribution version [7.5.2].
2020-03-13 18:23:39,936 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default --version] with logging.
2020-03-13 18:23:39,938 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG git version 2.9.3

2020-03-13 18:23:39,938 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default --version] finished with return code [0].
2020-03-13 18:23:39,938 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default checkout 7] with logging.
2020-03-13 18:23:39,945 ActorAddr-(T|:37573)/PID:4212 esrally.utils.process INFO Already on '7'
Your branch is up-to-date with 'origin/7'.

One other thing to consider:

When Rally appears to be hanging as you say, the target nodes are probably downloading Elasticsearch. (However some output should be emitted on stdout)

Can you paste ~/.rally/logs/rally.log from both target nodes (not the coordinator)?

Last time I waited overnight for many hours nothing moved hence I said not progressing...Here is target host rally.log

vm-2

    [elastic@elastic-vm-2 .rally]$ tail -f /home/elastic/.rally/logs/rally.log
    2020-03-13 17:17:26,798 -not-actor-/PID:1637 root INFO ---- Actor System shutdown
    2020-03-13 17:54:08,762 -not-actor-/PID:3689 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'ip': '10.156.129.121', 'Convention Address.IPv4': '10.156.129.118:1900', 'coordinator': False}].
    2020-03-13 17:54:08,784 -not-actor-/PID:3693 root INFO ++++ Actor System gen (3, 9) started, admin @ ActorAddr-(T|:1900)
    2020-03-13 17:54:08,784 -not-actor-/PID:3693 root DEBUG Thespian source: /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/thespian/__init__.py
    2020-03-13 18:21:06,610 -not-actor-/PID:3726 esrally.actor INFO Joining already running actor system with system base [multiprocTCPBase].
    2020-03-13 18:21:06,629 -not-actor-/PID:3694 root INFO ActorSystem Logging Shutdown
    2020-03-13 18:21:06,631 -not-actor-/PID:3693 root INFO ---- Actor System shutdown
    2020-03-13 18:22:37,732 -not-actor-/PID:3729 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'ip': '10.156.129.121', 'coordinator': False, 'Convention Address.IPv4': '10.156.129.118:1900'}].
    2020-03-13 18:22:37,753 -not-actor-/PID:3733 root INFO ++++ Actor System gen (3, 9) started, admin @ ActorAddr-(T|:1900)
    2020-03-13 18:22:37,754 -not-actor-/PID:3733 root DEBUG Thespian source: /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/thespian/__init__.py

` vm-4

[elastic@elastic-vm-4 .rally]$ tail -f /home/elastic/.rally/logs/rally.log
2020-03-13 17:58:01,576 ActorAddr-(T|:1900)/PID:3690 esrally.actor INFO Capabilities [{'ip': '10.156.129.169', 'Thespian Generation': (3, 9), 'Thespian Version': '1584122070188', 'coordinator': False, 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Python Version': (3, 5, 1, 'final', 0), 'Thespian ActorSystem Version': 2, 'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Watch Supported': True}] match requirements [{'ip': '10.156.129.169'}].
2020-03-13 18:19:53,420 ActorAddr-(T|:34807)/PID:3693 esrally.actor DEBUG NodeMechanicActor#receiveMessage(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|10.156.129.118:33016)])
2020-03-13 18:21:17,255 -not-actor-/PID:3725 esrally.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2020-03-13 18:21:17,276 -not-actor-/PID:3691 root INFO ActorSystem Logging Shutdown
2020-03-13 18:21:17,298 -not-actor-/PID:3690 root INFO ---- Actor System shutdown
2020-03-13 18:22:45,358 -not-actor-/PID:3728 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'ip': '10.156.129.169', 'Convention Address.IPv4': '10.156.129.118:1900', 'coordinator': False}].
2020-03-13 18:22:45,379 -not-actor-/PID:3732 root INFO ++++ Actor System gen (3, 9) started, admin @ ActorAddr-(T|:1900)
2020-03-13 18:22:45,380 -not-actor-/PID:3732 root DEBUG Thespian source: /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/thespian/__init__.py
2020-03-13 18:23:41,109 ActorAddr-(T|:1900)/PID:3732 esrally.actor INFO Capabilities [{'Python Version': (3, 5, 1, 'final', 0), 'ip': '10.156.129.169', 'Thespian Watch Supported': True, 'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'coordinator': False, 'Thespian Version': '1584123765373', 'Thespian Generation': (3, 9)}] match requirements [{'ip': '10.156.129.169'}].
2020-03-13 18:23:41,110 ActorAddr-(T|:1900)/PID:3732 esrally.actor INFO Capabilities [{'Python Version': (3, 5, 1, 'final', 0), 'ip': '10.156.129.169', 'Thespian Watch Supported': True, 'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'coordinator': False, 'Thespian Version': '1584123765373', 'Thespian Generation': (3, 9)}] match requirements [{'ip': '10.156.129.169'}].

Did you ensure that all Python/Rally process have been killed (killall -9 esrallyd, killall -9 esrally) on all VMs before you restarted esrallyd and then invoking Rally with --target-hosts=IP1:39200,IP2:39200?

Also can you paste the complete esrally command that you are using on the coordinator?

I did "esrallyd stop" on all nodes before invoking new rally test. Also verified with "ps -ef | grep esrallyd" to ensure no esrallyd process is running.

On coordinator just changed the port number it was the same command.

[elastic@elastic-vm-1 tmp]$ esrally --distribution-version=7.5.2 --target-hosts=10.40.195.121:39200,10.156.129.169:39200

I will try to do killall -9 esrallyd , killall -9 esrally on all VMs and then restart test and will update here..

@dliappis I rerun the test again after killall -9 esrallyd , killall -9 esrally on all VMs. No luck the unfortunately the same issue exists

Hi,

Could you perform the following steps:

  1. Rerun and paste output of esrally --version on all vms.

  2. Set logging to DEBUG in ~/.rally/logging.json (see the first sentence of my [earlier comment](Cluster provision and benchmark stalls using Rally on how to do it), restart esrallyd, rerun esrally as you've done so far and paste ~/.rally/logs/rally.log from all nodes (coordinator and target node)

  3. Stop all esrally/esrallyd (just ps -ef | grep -i rally and kill everything associated on all vms).
    Export the following environment variables:

    export THESPLOG_FILE="${HOME}/.rally/logs/actor-system-internal.log"
    export THESPLOG_FILE_MAXSIZE=204800
    export THESPLOG_THRESHOLD="INFO"
    

    Start again esrallyd on all nodes, and run your esrally commands.
    Finally then share the contents of ~/.rally/logs/actor-system-internal.log from all vms.

@dliappis Followed you comments and reran again.

I feel this error in VM1 is one the blocking issue . What does Warn Convention registration from ActorAddr-(T|:1900) is an invalid address mean. Earlier also I noticed this in /tmp/thespian.log and posed in my first post

vm1 : actor-system-internal.log

[root@elastic-vm-1 logs]# tail -f actor-system-internal.log
2020-03-16 11:03:54.558068 p8901 Warn Convention registration from ActorAddr-(T|:1900) is an invalid address; ignoring.
2020-03-16 11:14:01.063494 p8901 Warn Convention registration from ActorAddr-(T|:1900) is an invalid address; ignoring.
2020-03-16 11:24:07.824471 p8901 Warn Convention registration from ActorAddr-(T|:1900) is an invalid address; ignoring.

vm2 : actor-system-internal.log

[elastic@elastic-vm-2 .rally]$ cat /home/elastic/.rally/logs/actor-system-internal.log
2020-03-16 11:04:10.859003 p8039 I    ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 9)
2020-03-16 11:04:10.867003 p8039 I    Admin registering with Convention @ ActorAddr-(T|10.156.129.118:1900) (first time)
2020-03-16 11:04:10.867608 p8039 I    Setting log aggregator of ActorAddr-(T|:33220) to ActorAddr-(T|10.156.129.118:1900)
2020-03-16 11:04:10.869669 p8039 I    Got Convention registration from ActorAddr-(T|10.156.129.118:1900) (re-registering) (new? True)
2020-03-16 11:14:55.272684 p8039 I    Admin registering with Convention @ ActorAddr-(T|10.156.129.118:1900) (re-registering)
2020-03-16 11:14:55.273815 p8039 I    Setting log aggregator of ActorAddr-(T|:33220) to ActorAddr-(T|10.156.129.118:1900)
2020-03-16 11:14:55.275830 p8039 I    Got Convention registration from ActorAddr-(T|10.156.129.118:1900) (re-registering) (new? False)
2020-03-16 11:22:17.833819 p8039 I    Admin registering with Convention @ ActorAddr-(T|10.156.129.118:1900) (re-registering)
2020-03-16 11:22:17.834338 p8039 I    Setting log aggregator of ActorAddr-(T|:33220) to ActorAddr-(T|10.156.129.118:1900)
2020-03-16 11:22:17.835912 p8039 I    Got Convention registration from ActorAddr-(T|10.156.129.118:1900) (re-registering) (new? False)
2020-03-16 11:29:40.394369 p8039 I    Admin registering with Convention @ ActorAddr-(T|10.156.129.118:1900) (re-registering)
2020-03-16 11:29:40.395022 p8039 I    Setting log aggregator of ActorAddr-(T|:33220) to ActorAddr-(T|10.156.129.118:1900)
2020-03-16 11:29:40.397052 p8039 I    Got Convention registration from ActorAddr-(T|10.156.129.118:1900) (re-registering) (new? False)

vm4 : actor-system-internal.log - Unlike target (vm2) this target (vm4) one does not have" Got Convention registration from ActorAddr" this message looks relevant to above log in on coordinator (vm1)

[elastic@elastic-vm-4 .rally]$ cat /home/elastic/.rally/logs/actor-system-internal.log
2020-03-16 11:03:54.840395 p8023 I    ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 9)
2020-03-16 11:03:54.847412 p8023 I    Admin registering with Convention @ ActorAddr-(T|10.156.129.118:1900) (first time)
2020-03-16 11:03:54.847947 p8023 I    Setting log aggregator of ActorAddr-(T|:43301) to ActorAddr-(T|10.156.129.118:1900)
2020-03-16 11:14:01.353790 p8023 I    Admin registering with Convention @ ActorAddr-(T|10.156.129.118:1900) (first time)
2020-03-16 11:14:01.354349 p8023 I    Setting log aggregator of ActorAddr-(T|:43301) to ActorAddr-(T|10.156.129.118:1900)
2020-03-16 11:24:08.114846 p8023 I    Admin registering with Convention @ ActorAddr-(T|10.156.129.118:1900) (first time)
2020-03-16 11:24:08.115364 p8023 I    Setting log aggregator of ActorAddr-(T|:43301) to ActorAddr-(T|10.156.129.118:1900)
2020-03-16 11:34:14.876119 p8023 I    Admin registering with Convention @ ActorAddr-(T|10.156.129.118:1900) (first time)
2020-03-16 11:34:14.876864 p8023 I    Setting log aggregator of ActorAddr-(T|:43301) to ActorAddr-(T|10.156.129.118:1900)

version is same across the vms

[elastic@elastic-vm-1 ~]$ esrally --version
esrally 1.4.0
[elastic@elastic-vm-2 .rally]$ esrally --version
esrally 1.4.0
[elastic@elastic-vm-4 .rally]$ esrally --version
esrally 1.4.0

rally.log across the VMs looks similar after running esrally. Due to restriction on size I will share them in another post. Any other way to upload log file to do at one shot instead of multiple posts ?

Additional logs- continued from previous post

vm1- coordinator rally.log (initial logs after running)

2020-03-16 18:03:33,601 -not-actor-/PID:8897 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'ip': '10.156.129.118', 'Convention Address.IPv4': '10.156.129.118:1900', 'coordinator': True}].
2020-03-16 18:03:33,622 -not-actor-/PID:8901 root INFO ++++ Actor System gen (3, 9) started, admin @ ActorAddr-(T|:1900)
2020-03-16 18:03:33,623 -not-actor-/PID:8901 root DEBUG Thespian source: /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/thespian/__init__.py
2020-03-16 18:26:11,500 -not-actor-/PID:8920 esrally.rally INFO OS [uname_result(system='Linux', node='elastic-vm-1', release='3.10.0-514.el7.x86_64', version='#1 SMP Tue Nov 22 16:42:41 UTC 2016', machine='x86_64', processor='x86_64')]
2020-03-16 18:26:11,500 -not-actor-/PID:8920 esrally.rally INFO Python [namespace(cache_tag='cpython-35', hexversion=50659824, name='cpython', version=sys.version_info(major=3, minor=5, micro=1, releaselevel='final', serial=0))]
2020-03-16 18:26:11,500 -not-actor-/PID:8920 esrally.rally INFO Rally version [1.4.0]
2020-03-16 18:26:11,500 -not-actor-/PID:8920 esrally.rally DEBUG Command line arguments: Namespace(advanced_config=False, assume_defaults=False, car='defaults', car_params='', challenge=None, client_options='timeout:60', configuration_name=None, distribution_repository='release', distribution_version='7.5.2', effective_start_date=None, elasticsearch_plugins='', enable_driver_profiling=False, exclude_tasks=None, include_tasks=None, keep_cluster_running=False, load_driver_hosts='localhost', offline=False, on_error='continue', pipeline='', plugin_params='', preserve_install='False', quiet=False, race_id='a8df2c65-f0e2-42cf-93c6-786489de918b', report_file='', report_format='markdown', revision='current', runtime_jdk=None, show_in_report='available', skip_rest_api_check=False, subcommand=None, target_hosts='10.40.195.121:39200,10.156.129.169:39200', team_path=None, team_repository='default', team_revision=None, telemetry='', telemetry_params='', test_mode=False, track=None, track_params='', track_path=None, track_repository='default', track_revision=None, user_tag='')
2020-03-16 18:26:11,501 -not-actor-/PID:8920 esrally.utils.net INFO Connecting directly to the Internet (no proxy support).
2020-03-16 18:26:11,501 -not-actor-/PID:8920 esrally.utils.net DEBUG Checking for internet connection against [https://github.com/]
2020-03-16 18:26:11,852 -not-actor-/PID:8920 esrally.utils.net DEBUG Probing result is HTTP status [200]
2020-03-16 18:26:11,852 -not-actor-/PID:8920 esrally.rally INFO Detected a working Internet connection.
2020-03-16 18:26:11,878 -not-actor-/PID:8920 esrally.rally INFO Actor system already running locally? [True]
2020-03-16 18:26:11,878 -not-actor-/PID:8920 esrally.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2020-03-16 18:26:11,895 -not-actor-/PID:8920 esrally.racecontrol INFO User specified no pipeline. Automatically derived pipeline [from-distribution].
2020-03-16 18:26:11,896 -not-actor-/PID:8920 esrally.racecontrol INFO Using configured hosts [{'host': '10.40.195.121', 'port': 39200}, {'host': '10.156.129.169', 'port': 39200}]
2020-03-16 18:26:11,896 -not-actor-/PID:8920 esrally.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2020-03-16 18:26:11,979 ActorAddr-(T|:1900)/PID:8901 esrally.driver.scheduler DEBUG Registering object [<class 'esrally.driver.scheduler.DeterministicScheduler'>] for [deterministic].
2020-03-16 18:26:11,979 ActorAddr-(T|:1900)/PID:8901 esrally.driver.scheduler DEBUG Registering object [<class 'esrally.driver.scheduler.PoissonScheduler'>] for [poisson].
2020-03-16 18:26:11,983 ActorAddr-(T|:1900)/PID:8901 esrally.actor INFO Capabilities [{'ip': '10.156.129.118', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Thespian Version': '1584381813616', 'Python Version': (3, 5, 1, 'final', 0), 'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'Thespian ActorSystem Name': 'multiprocTCPBase', 'coordinator': True}] match requirements [{'coordinator': True}].
2020-03-16 18:26:11,983 ActorAddr-(T|:1900)/PID:8901 esrally.actor INFO Capabilities [{'ip': '10.156.129.118', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Thespian Version': '1584381813616', 'Python Version': (3, 5, 1, 'final', 0), 'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'Thespian ActorSystem Name': 'multiprocTCPBase', 'coordinator': True}] match requirements [{'coordinator': True}].
2020-03-16 18:26:11,995 ActorAddr-(T|:35202)/PID:8927 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/tracks/default --version] with logging.
2020-03-16 18:26:11,999 ActorAddr-(T|:35202)/PID:8927 esrally.utils.process DEBUG git version 2.9.3

vm1- coordinator rally.log (last few lines when it is not moving forward)
There are many lines in between due to size limitation of post trying to share only relevant.

2020-03-16 18:26:13,358 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG git version 2.9.3

2020-03-16 18:26:13,358 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG Subprocess [git -C /home/elastic/.rally/benchmarks/teams/default --version] finished with return code [0].
2020-03-16 18:26:13,359 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/teams/default checkout 7] with logging.
2020-03-16 18:26:13,365 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process INFO Already on '7'
Your branch is up-to-date with 'origin/7'.

2020-03-16 18:26:13,365 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG Subprocess [git -C /home/elastic/.rally/benchmarks/teams/default checkout 7] finished with return code [0].
2020-03-16 18:26:13,365 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/teams/default rebase origin/7] with logging.
2020-03-16 18:26:13,407 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process INFO Current branch 7 is up to date.

2020-03-16 18:26:13,408 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG Subprocess [git -C /home/elastic/.rally/benchmarks/teams/default rebase origin/7] finished with return code [0].
2020-03-16 18:26:13,408 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/teams/default --version] with logging.
2020-03-16 18:26:13,410 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG git version 2.9.3

2020-03-16 18:26:13,410 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG Subprocess [git -C /home/elastic/.rally/benchmarks/teams/default --version] finished with return code [0].
2020-03-16 18:26:13,410 ActorAddr-(T|:37967)/PID:8991 esrally.utils.process DEBUG Running subprocess [git -C /home/elastic/.rally/benchmarks/teams/default rev-parse --short HEAD] with output.
2020-03-16 18:26:13,415 ActorAddr-(T|:37967)/PID:8991 esrally.actor INFO Cluster consisting of [{'port': 39200, 'host': '10.40.195.121'}, {'port': 39200, 'host': '10.156.129.169'}] will be provisioned by Rally.
2020-03-16 18:26:13,416 ActorAddr-(T|:37967)/PID:8991 esrally.actor INFO Capabilities [{'ip': '10.156.129.118', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Thespian Version': '1584381813616', 'Python Version': (3, 5, 1, 'final', 0), 'Convention Address.IPv4': '10.156.129.118:1900', 'Thespian Generation': (3, 9), 'Thespian ActorSystem Name': 'multiprocTCPBase', 'coordinator': True}] match requirements [{}].
2020-03-16 18:48:08,69 -not-actor-/PID:8920 esrally.racecontrol INFO User has cancelled the benchmark (detected by race control).
2020-03-16 18:48:08,72 -not-actor-/PID:8920 esrally.racecontrol INFO Telling benchmark actor to exit.
2020-03-16 18:48:08,73 ActorAddr-(T|:35202)/PID:8927 esrally.actor INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring).
2020-03-16 18:48:08,75 ActorAddr-(T|:37967)/PID:8991 esrally.actor INFO MechanicActor#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:35202)])
2020-03-16 18:48:08,77 ActorAddr-(T|:36449)/PID:9055 esrally.actor INFO mechanic.Dispatcher#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:37967)])
2020-03-16 18:48:08,79 ActorAddr-(T|:37967)/PID:8991 esrally.actor ERROR Child actor exited with [ChildActorExited:ActorAddr-(T|:36449)] while in status [starting].
2020-03-16 18:48:08,85 ActorAddr-(T|:35202)/PID:8927 esrally.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:37967)] (ignoring).

vm2 target rally.log

2020-03-16 17:46:34,175 -not-actor-/PID:3937 root INFO ActorSystem Logging Shutdown
2020-03-16 17:46:34,177 -not-actor-/PID:3936 root INFO ---- Actor System shutdown
2020-03-16 18:04:10,838 -not-actor-/PID:8035 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'ip': '10.156.129.121', 'Convention Address.IPv4': '10.156.129.118:1900', 'coordinator': False}].
2020-03-16 18:04:10,859 -not-actor-/PID:8039 root INFO ++++ Actor System gen (3, 9) started, admin @ ActorAddr-(T|:1900)
2020-03-16 18:04:10,859 -not-actor-/PID:8039 root DEBUG Thespian source: /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/thespian/__init__.py

vm4 target rally.log

2020-03-16 17:46:45,205 -not-actor-/PID:7980 esrally.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2020-03-16 17:46:45,226 -not-actor-/PID:3934 root INFO ActorSystem Logging Shutdown
2020-03-16 17:46:45,247 -not-actor-/PID:3933 root INFO ---- Actor System shutdown
2020-03-16 18:03:54,816 -not-actor-/PID:8019 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'ip': '10.156.129.169', 'Convention Address.IPv4': '10.156.129.118:1900', 'coordinator': False}].
2020-03-16 18:03:54,840 -not-actor-/PID:8023 root INFO ++++ Actor System gen (3, 9) started, admin @ ActorAddr-(T|:1900)
2020-03-16 18:03:54,840 -not-actor-/PID:8023 root DEBUG Thespian source: /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/thespian/__init__.py

Can you try upgrading to Rally 1.4.1 and try again?

Hi @ven100 , you can post to https://gist.github.com/ with the whole log files, and then just link them to us.

1 Like

Thanks @dliappis for all the suggestions so far. Upgraded to 1.4.1 and tried again same behavior and errors.

[root@elastic-vm-1 logs]# tail -2 /tmp/thespian.log
2020-03-16 14:37:56.149123 p9302 Warn Convention registration from ActorAddr-(T|:1900) is an invalid address; ignoring.
2020-03-16 14:48:02.632891 p9302 Warn Convention registration from ActorAddr-(T|:1900) is an invalid address; ignoring.
[root@elastic-vm-1 logs]#