Unable to increase REST API startup timeout

Hi,

I am a newcomer to Rally and having a go at running some benchmarks. My OS is Ubuntu Linux 18.04.1 64-bit. Rally version is 1.4.1. I have created a custom track and set of cars, and have a benchmark that runs nicely by default (i.e. if I have a vanilla defaults car that does nothing special to start the REST server).

However, when I change the configuration to use a different car mixin, REST requests to the server don't get through because of my mixin having added some special java agent properties to the server's JVM startup, which I can't describe as they are specific to my enterprise. Enough to say that the server takes too long to start up, and I don't seem able to increase the server startup timeout so that it waits until the server is ready and all that agent stuff has done its work.

My rally.ini file is here:

Summary
[meta]
config.version = 17

[system]
env.name = local

[node]
root.dir = /home/spayne/.rally/benchmarks
src.root.dir = /home/spayne/.rally/benchmarks/src

[source]
remote.repo.url = https://github.com/elastic/elasticsearch.git
elasticsearch.src.subdir = elasticsearch

[benchmarks]
local.dataset.cache = /home/spayne/.rally/benchmarks/data

[reporting]
datastore.type = in-memory
datastore.host = 
datastore.port = 
datastore.secure = False
datastore.user = 
datastore.password = 

[tracks]
default.url = https://github.com/elastic/rally-tracks

[teams]
default.url = https://github.com/elastic/rally-teams

[defaults]
preserve_benchmark_candidate = False

[distributions]
release.cache = true

An example race's elasticsearch.yml file is here:

Summary
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please see the documentation for further information on configuration options:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration.html>
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: rally-benchmark
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: rally-node-0
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: ['/home/spayne/.rally/benchmarks/races/8fb02ac0-da4d-492b-8f63-ed4bf9422751/rally-node-0/install/elasticsearch-6.6.2/data']
#
# Path to log files:
#
path.logs: /home/spayne/.rally/benchmarks/races/8fb02ac0-da4d-492b-8f63-ed4bf9422751/rally-node-0/logs/server
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 127.0.0.1
#
# Set a custom port for HTTP:
#
http.port: 39200-39300

transport.tcp.port: 39300-39400
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html>
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: ["127.0.0.1"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 1

discovery.zen.fd.ping_internal: 30s
discovery.zen.fd.ping_timeout: 120s
discovery.zen.fd.ping_retries: 100

#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery.html>
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-gateway.html>
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

And the rally.log file is here (snippet from the end of the run, there are hundreds of refused connection messages:

Summary
2020-03-02 15:06:07,548 -not-actor-/PID:30588 elasticsearch WARNING GET http://127.0.0.1:39200/_cluster/health?wait_for_nodes=%3E%3D1 [status:N/A request:0.000s]
Traceback (most recent call last):
  File "/home/spayne/.local/lib/python3.6/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/spayne/.local/lib/python3.6/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/home/spayne/.local/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

[...LOTS OF LINES REDACTED TO SAVE SPACE..]
    raise e
  File "/home/spayne/.local/lib/python3.6/site-packages/esrally/racecontrol.py", line 362, in run
    pipeline(cfg)
  File "/home/spayne/.local/lib/python3.6/site-packages/esrally/racecontrol.py", line 79, in __call__
    self.target(cfg)
  File "/home/spayne/.local/lib/python3.6/site-packages/esrally/racecontrol.py", line 290, in from_distribution
    return race(cfg, distribution=True)
  File "/home/spayne/.local/lib/python3.6/site-packages/esrally/racecontrol.py", line 250, in race
    raise exceptions.RallyError(result.message, result.cause)
esrally.exceptions.RallyError: ("Error in driver (('Elasticsearch REST API layer is not available.', None))", None)

A sample command-line request to run a race is:

esrally --car=defaults,my-agent-mixin --pipeline=from-distribution --distribution-version=6.6.2 --track-path=/home/spayne/elasticsearch-src/custom-tracks --team-repository=/home/spayne/elasticsearch-src/teams --report-file=/home/spayne/elasticsearch-src/reports/my-agent-debug.md --report-format=markdown --client-options="timeout:1200" --preserve-install=true

The my-agent-mixin adds an extra few JVM system properties to jvm.options that slow down the startup by running a particular javaagent.

Console output is:

   ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Preparing for race ...
[INFO] Preserving benchmark candidate installation at [/home/spayne/.rally/benchmarks/races/8fb02ac0-da4d-492b-8f63-ed4bf9422751/rally-node-0/install/elasticsearch-6.6.2].
[ERROR] Cannot race. Error in driver (('Elasticsearch REST API layer is not available.', None))

Getting further help:
*********************
* Check the log files in /home/spayne/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/1.4.1/
* Ask a question on the forum at https://discuss.elastic.co/c/elasticsearch/rally
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /home/spayne/.rally/logs.

---------------------------------
[INFO] FAILURE (took 152 seconds)
---------------------------------

and whatever I do I don't seem able to make it take longer than about 150s. (e.g. I tried twiddling the zen discovery fault detection timeout and poll interval, which made no difference; and I added a long timeout to the --client-options, likewise no difference.

I guess the connection errors in rally.log are because the server hasn't yet finished starting up. The core issue is that I don't seem able to make the process continue for the time needed to fnish the slow startup (about 3 to 5 minutes typically).

Any suggestions?

Thanks,

Simon

Hey Simon,

That is hardcoded here and it is 40 tries with a 3 second sleep between each try.

I am going to open up a issue to see about making this an external flag and I will comment when Ive done so here.

I have created the issue here

1 Like

Hi Michael,

That reference to the driver.py file was enough for me to get unblocked. Thanks very much.

Simon

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.