[rally] Could not connect to your Elasticsearch metrics store

In long:
Hey, I am trying to do a benchmark test against a cluster. The cluster in question contains 3 nodes in virtualbox, with no ssl restriction or authentication (fully for test). Then I have another cluster hosted in Azure where I want to store the metric store but I enabled SSL and HTTPS for this Azure cluster.

In short:

  • local cluster in virtualbox to run benchmark test with rally in my local, SSL and HTTPS are disabled.
  • cluster on the cloud to store the rally metrics, with SSL and HTTPS enabled.

Configuration:

  • local cluster to run the test, using ELK 7.9.3
  • cluster on the cloud: ELK 7.12.0
  • esrally 2.2.1
  • python 3.8.10 using pyenv

rally.ini:

...
[reporting]
datastore.type = elasticsearch
datastore.host = myhost.myregion.cloudapp.azure.com
datastore.port = 9200
datastore.secure = true
datastore.user = elastic # is the default user with admin privileges
datastore.password = ****
datastore.ssl.verification_mode = none
datastore.ssl.certificate_authorities = /home/user/.rally/elasticsearch-ca.pem
...

Output of the command: esrally race --track=percolator --target-hosts=node1:9200,node2:9200,node3:9200 --pipeline=benchmark-only --kill-running-processes:


    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[ERROR] Cannot race. Error in race control (Could not connect to your Elasticsearch metrics store. Please check that it is running on host [myhost.myregion.cloudapp.azure.com] at port [9200] or fix the configuration in [/home/user/.rally/rally.ini].)

---------------------------------
[INFO] FAILURE (took 487 seconds)
---------------------------------

rally.log:

...
2021-09-09 09:43:44,904 ActorAddr-(T|:41764)/PID:3073 esrally.actor INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring).
2021-09-09 09:43:47,907 -not-actor-/PID:3066 esrally.rally INFO Attempting to shutdown internal actor system.
2021-09-09 09:43:47,910 -not-actor-/PID:3072 root INFO ActorSystem Logging Shutdown
2021-09-09 09:43:47,919 -not-actor-/PID:3066 esrally.rally INFO Actor system is still running. Waiting...
2021-09-09 09:43:47,919 -not-actor-/PID:3071 root INFO ---- Actor System shutdown
2021-09-09 09:43:48,920 -not-actor-/PID:3066 esrally.rally INFO Shutdown completed.
2021-09-09 09:43:48,921 -not-actor-/PID:3066 esrally.rally ERROR Cannot run subcommand [race].
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/esrally/rally.py", line 854, in dispatch_sub_command
    race(cfg, args.kill_running_processes)
  File "/home/user/.local/lib/python3.8/site-packages/esrally/rally.py", line 637, in race
    with_actor_system(racecontrol.run, cfg)
  File "/home/user/.local/lib/python3.8/site-packages/esrally/rally.py", line 664, in with_actor_system
    runnable(cfg)
  File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 357, in run
    raise e
  File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 354, in run
    pipeline(cfg)
  File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 60, in __call__
    self.target(cfg)
  File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 292, in benchmark_only
    return race(cfg, external=True)
  File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 251, in race
    raise exceptions.RallyError(result.message, result.cause)
esrally.exceptions.RallyError: Error in race control (Could not connect to your Elasticsearch metrics store. Please check that it is running on host [myhost.myregion.cloudapp.azure.com] at port [9200] or fix the configuration in [/home/user/.rally/rally.ini].)

I don't know what I am doing wrong, should I do a sort of configuration in azure cluster in order for rally to create the metric store?

start of the log (can't post the whole logs due to charachter limitation):

2021-09-09 09:35:41,377 -not-actor-/PID:3066 esrally.rally INFO OS [uname_result(system='Linux', node='node1', release='4.4.0-186-generic', version='#216-Ubuntu SMP Wed Jul 1 05:34:05 UTC 2020', machine='x86_64', processor='x86_64')]
2021-09-09 09:35:41,378 -not-actor-/PID:3066 esrally.rally INFO Python [namespace(_multiarch='x86_64-linux-gnu', cache_tag='cpython-38', hexversion=50858736, name='cpython', version=sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0))]
2021-09-09 09:35:41,378 -not-actor-/PID:3066 esrally.rally INFO Rally version [2.2.1]
2021-09-09 09:35:41,378 -not-actor-/PID:3066 esrally.utils.net INFO Connecting directly to the Internet (no proxy support).
2021-09-09 09:35:43,750 -not-actor-/PID:3066 esrally.rally INFO Detected a working Internet connection.
2021-09-09 09:35:43,751 -not-actor-/PID:3066 esrally.rally INFO Killing running Rally processes
2021-09-09 09:35:43,775 -not-actor-/PID:3066 esrally.utils.process INFO Killing lingering process with PID [2996] and command line [['/home/user/.pyenv/versions/3.8.10/bin/python3', '/home/user/.local/bin/esrally', 'race', '--track=percolator', '--target-hosts=node1:9200,node2:9200,node3:9200', '--pipeline=benchmark-only', '--kill-running-processes']].
2021-09-09 09:35:43,778 -not-actor-/PID:3066 esrally.utils.process INFO Killing lingering process with PID [2997] and command line [['/home/user/.pyenv/versions/3.8.10/bin/python3', '/home/user/.local/bin/esrally', 'race', '--track=percolator', '--target-hosts=node1:9200,node2:9200,node3:9200', '--pipeline=benchmark-only', '--kill-running-processes']].
2021-09-09 09:35:43,779 -not-actor-/PID:3066 esrally.utils.process INFO Killing lingering process with PID [2998] and command line [['/home/user/.pyenv/versions/3.8.10/bin/python3', '/home/user/.local/bin/esrally', 'race', '--track=percolator', '--target-hosts=node1:9200,node2:9200,node3:9200', '--pipeline=benchmark-only', '--kill-running-processes']].
2021-09-09 09:35:43,782 -not-actor-/PID:3066 esrally.rally INFO Actor system already running locally? [False]
2021-09-09 09:35:43,782 -not-actor-/PID:3066 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900'}].
2021-09-09 09:35:43,804 -not-actor-/PID:3071 root INFO ++++ Actor System gen (3, 10) started, admin @ ActorAddr-(T|:1900)
2021-09-09 09:35:43,818 -not-actor-/PID:3066 esrally.racecontrol INFO Race id [bf8e11ef-d8f1-4f22-a358-28c001474652]
2021-09-09 09:35:43,819 -not-actor-/PID:3066 esrally.racecontrol INFO User specified pipeline [benchmark-only].
2021-09-09 09:35:43,819 -not-actor-/PID:3066 esrally.racecontrol INFO Using configured hosts [{'host': 'node1', 'port': 9200}, {'host': 'node2', 'port': 9200}, {'host': 'node3', 'port': 9200}]
2021-09-09 09:35:43,821 -not-actor-/PID:3066 esrally.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2021-09-09 09:35:43,823 ActorAddr-(T|:1900)/PID:3071 esrally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 8, 10, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1631180143796'}] match requirements [{'coordinator': True}].
2021-09-09 09:37:44,474 -not-actor-/PID:3073 elasticsearch WARNING GET https://myhost.myregion.cloudapp.azure.com:9200/ [status:N/A request:120.086s]
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/connection.py", line 169, in _new_conn
    conn = connection.create_connection(
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 245, in perform_request
    response = self.pool.urlopen(
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/util/retry.py", line 507, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/connection.py", line 353, in connect
    conn = self._new_conn()
  File "/home/user/.local/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
    raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fd115633bb0>, 'Connection to myhost.myregion.cloudapp.azure.com timed out. (connect timeout=120)')
...

Hi! Thanks for the detailed information.

HTTPS/TLS is a red herring in this case: I can see from the traceback that we can't establish the TCP connection at all.

Can you please try connecting to myhost.myregion.cloudapp.azure.com:9200 directly, for example using netcat, and report your findings?

nc -v myhost.myregion.cloudapp.azure.com 9200

PS: Testing in VirtualBox is unlikely to get you any meaningful results due to the virtualization overhead.

yes indeed, the result of nc -v myhost.myregion.cloudapp.azure.com 9200 is nc: connect to myhost.myregion.cloudapp.azure.com port 9200 (tcp) failed: Connection timed out. But one thing to mention, curl command worked before curl -k -v -u elastic:mypassword --cacert elasticsearch-ca.pem https://myhost.myregion.cloudapp.azure.com:9200 and can't reproduce a successfull response. In the other hand I already tried to store rally metrics in elastic cloud instance and had the same error; which is a timeout.
But still can access https://myhost.myregion.cloudapp.azure.com:9200 from a web browser, same when I tried elastic cloud.

Finally, the TLS certificate is self-signed used in the azure cluster.

One more thing is not clear for me in rally doc:

... At the end of a race, Rally stores all metrics records in its metrics store, which is a dedicated Elasticsearch cluster.

This means that I need to have a specific elasticsearch cluster, a sort of configuration and there must be no indices in it, because the azure cluster already have some indices in it.

But still can access https://myhost.myregion.cloudapp.azure.com:9200 from a web browser, same when I tried elastic cloud.

Reading between lines, are you saying that you can access Azure outside the VM (using your browser), but not inside the VM (where Rally runs)?

This means that I need to have a specific elasticsearch cluster, a sort of configuration and there must be no indices in it, because the azure cluster already have some indices in it.

It's OK to have other indices, what we mean by dedicated is that it should be a different cluster than the one Rally is benchmarking. We'll make sure to clarify the docs here, the wording is confusing indeed.

yes I can't access Azure VM inside virtualbox where rally runs. I tested now with curl www.google.com and getting curl: (6) Could not resolve host: www.google.com. I think it's a DNS issue, not sure what's wrong!

yes I can't access Azure VM inside virtualbox where rally runs.

Unfortunately we can't help you with configuring your VM. When networking works, please answer to this topic or create a new one if you're facing any Rally issue!

However, I'd like to emphasize again that you're unlikely to get any interesting results with VirtualBox. Maybe Docker is an option? We support that explicity: Running Rally with Docker — Rally 2.2.1 documentation

1 Like

I agree @Quentin_Pradet, at the end found that I have a DNS issue in my virtualbox, the workaround I did is to create a second VM in the same network in virtualbox to store metrics (only for testing the feature).

For your suggestion using Docker, I still don't have the skills to use docker ^^, but I will soon ;).

Thank you for your assitance!

1 Like