Rally stalls when I try to benchmark Elasticsearch cluster

Akshay_Raj1 · July 30, 2020, 5:54pm

Here are the details:

[rally1@rally ~]$ python3 --version
Python 3.6.8
[rally1@rally ~]$ java -version
openjdk version "14.0.2" 2020-07-14
OpenJDK Runtime Environment (build 14.0.2+12-46)
OpenJDK 64-Bit Server VM (build 14.0.2+12-46, mixed mode, sharing)
[rally1@rally ~]$ esrally --version
esrally 1.4.1
[rally1@rally ~]$ curl -X GET 'http://10.20.0.49:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "es-cluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 11,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 0.0
}
[rally1@rally ~]$ esrally --track=geopoint --challenge=append-no-conflicts --target-hosts=10.20.0.49:9200,10.20.0.68:9200 --pipeline=benchmark-only

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/


************************************************************************
************** WARNING: A dark dungeon lies ahead of you  **************
************************************************************************

Rally does not have control over the configuration of the benchmarked
Elasticsearch cluster.

Be aware that results may be misleading due to problems with the setup.
Rally is also not able to gather lots of metrics at all (like CPU usage
of the benchmarked cluster) or may even produce misleading metrics (like
the index size).

************************************************************************
****** Use this pipeline only if you are aware of the tradeoffs.  ******
*************************** Watch your step! ***************************
************************************************************************


[ERROR] Cannot race. Error in race control (('An unknown error occurred while running the operation [bulk] against your Elasticsearch metrics store on host [10.20.0.49] at port [9200].', None))

Following is the rally logs error:

2020-07-30 17:43:07,31 ActorAddr-(T|:42513)/PID:13497 esrally.client INFO SSL support: off
2020-07-30 17:43:07,31 ActorAddr-(T|:42513)/PID:13497 esrally.client INFO HTTP basic authentication: off
2020-07-30 17:43:07,32 ActorAddr-(T|:42513)/PID:13497 esrally.client INFO HTTP compression: off
2020-07-30 17:43:07,39 ActorAddr-(T|:42513)/PID:13497 esrally.actor INFO Asking mechanic to start the engine.
2020-07-30 17:43:07,65 ActorAddr-(T|:42513)/PID:13497 esrally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1596130926535'}] match requirements [{'coordinator': True}].
2020-07-30 17:43:07,67 ActorAddr-(T|:42513)/PID:13497 esrally.actor INFO Telling driver to prepare for benchmarking.
2020-07-30 17:43:07,76 ActorAddr-(T|:41153)/PID:13577 esrally.client INFO Creating ES client connected to [{'host': '10.20.0.49', 'port': '9200'}] with options [{'use_ssl': False, 'verify_certs': True, 'timeout': 120}]
2020-07-30 17:43:07,76 ActorAddr-(T|:41153)/PID:13577 esrally.client INFO SSL support: off
2020-07-30 17:43:07,76 ActorAddr-(T|:41153)/PID:13577 esrally.client INFO HTTP basic authentication: off
2020-07-30 17:43:07,77 ActorAddr-(T|:41153)/PID:13577 esrally.client INFO HTTP compression: off
2020-07-30 17:43:07,83 ActorAddr-(T|:41153)/PID:13577 esrally.metrics INFO Creating Elasticsearch metrics store
2020-07-30 17:43:07,84 ActorAddr-(T|:41153)/PID:13577 esrally.metrics INFO Opening metrics store for race timestamp=[20200730T174206Z], track=[geopoint], challenge=[append-no-conflicts], car=[['external']]
2020-07-30 17:43:07,113 ActorAddr-(T|:41153)/PID:13577 esrally.metrics INFO [rally-metrics-2020-07] already exists.
2020-07-30 17:44:07,116 ActorAddr-(T|:41153)/PID:13577 esrally.client INFO Creating ES client connected to [{'host': '10.20.0.49', 'port': 9200}, {'host': '10.20.0.68', 'port': 9200}] with options [{'timeout': 60, 'retry-on-timeout': True}]
2020-07-30 17:44:07,116 ActorAddr-(T|:41153)/PID:13577 esrally.client INFO SSL support: off
2020-07-30 17:44:07,143 ActorAddr-(T|:35571)/PID:13587 esrally.actor INFO Preparing track [geopoint]
2020-07-30 17:44:07,116 ActorAddr-(T|:41153)/PID:13577 esrally.client INFO HTTP basic authentication: off
2020-07-30 17:44:07,144 ActorAddr-(T|:35571)/PID:13587 esrally.track.loader INFO Resolved data root directory for document corpus [geopoint] in track [geopoint] to ['/elasticsearch/data/geopoint'].
2020-07-30 17:44:07,117 ActorAddr-(T|:41153)/PID:13577 esrally.client INFO HTTP compression: off
2020-07-30 17:44:07,117 ActorAddr-(T|:41153)/PID:13577 esrally.driver.driver INFO Checking if REST API is available.
2020-07-30 17:44:07,125 ActorAddr-(T|:41153)/PID:13577 esrally.client INFO REST API is available for >= [2] nodes after [0] attempts.
2020-07-30 17:44:07,126 ActorAddr-(T|:41153)/PID:13577 esrally.driver.driver INFO REST API is available.
2020-07-30 17:44:07,132 ActorAddr-(T|:41153)/PID:13577 esrally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 8, 'final', 0), 'Thespian Generation': (3, 9), 'Thespian Version': '1596130926535'}] match requirements [{'coordinator': True}].
2020-07-30 17:44:07,149 ActorAddr-(T|:41153)/PID:13577 esrally.actor INFO A track preparator has exited.
2020-07-30 17:44:07,152 ActorAddr-(T|:41153)/PID:13577 esrally.actor INFO A track preparator has exited.
2020-07-30 17:45:07,182 ActorAddr-(T|:42513)/PID:13497 esrally.metrics ERROR An unknown error occurred while running the operation [bulk] against your Elasticsearch metrics store on host [10.20.0.49] at port [9200].
Traceback (most recent call last):

  File "/usr/local/lib/python3.6/site-packages/esrally/metrics.py", line 115, in guarded
    return target(*args, **kwargs)

  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 300, in bulk
    for ok, item in streaming_bulk(client, actions, *args, **kwargs):

  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 230, in streaming_bulk
    **kwargs

  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 158, in _process_bulk_chunk
    raise BulkIndexError("%i document(s) failed to index." % len(errors), errors)
elasticsearch.helpers.errors.BulkIndexError: ('1 document(s) failed to index.', [{'index': {'_index': 'rally-races-2020-07', '_type': '_doc', '_id': '640ab0ca-7fc5-4f12-89f5-d3cc6d4f2bb2', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[rally-races-2020-07][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[rally-races-2020-07][0]] containing [index {[rally-races-2020-07][_doc][640ab0ca-7fc5-4f12-89f5-d3cc6d4f2bb2], source[_na_]}]]'}, 'data': {'rally-version': '1.4.1', 'environment': 'AMD_Benchmarking', 'trial-id': '640ab0ca-7fc5-4f12-89f5-d3cc6d4f2bb2', 'trial-timestamp': '20200730T174206Z', 'race-id': '640ab0ca-7fc5-4f12-89f5-d3cc6d4f2bb2', 'race-timestamp': '20200730T174206Z', 'pipeline': 'benchmark-only', 'user-tags': {}, 'track': 'geopoint', 'car': ['external'], 'cluster': {'revision': 'b5ca9c58fb664ca8bf9e4057fc229b3396bf3a89', 'distribution-version': '7.8.1', 'distribution-flavor': 'default', 'team-revision': None}, 'track-revision': '9a95395', 'challenge': 'append-no-conflicts'}}}])


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/usr/local/lib/python3.6/site-packages/esrally/actor.py", line 85, in guard
    return f(self, msg, sender)

  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 136, in receiveMsg_PreparationComplete
    self.race_store.store_race(self.race)

  File "/usr/local/lib/python3.6/site-packages/esrally/metrics.py", line 1419, in store_race
    self.es_store.store_race(race)

  File "/usr/local/lib/python3.6/site-packages/esrally/metrics.py", line 1394, in store_race
    self._store(race.as_dict())

  File "/usr/local/lib/python3.6/site-packages/esrally/metrics.py", line 1486, in _store
    self.client.index(index=self.index_name(), doc_type=EsRaceStore.RACE_DOC_TYPE, item=doc, id=doc["race-id"])

  File "/usr/local/lib/python3.6/site-packages/esrally/metrics.py", line 100, in index
    self.bulk_index(index, doc_type, [doc])

  File "/usr/local/lib/python3.6/site-packages/esrally/metrics.py", line 90, in bulk_index
    self.guarded(elasticsearch.helpers.bulk, self._client, items, index=index)

  File "/usr/local/lib/python3.6/site-packages/esrally/metrics.py", line 167, in guarded
    raise exceptions.RallyError(msg)

esrally.exceptions.RallyError: ('An unknown error occurred while running the operation [bulk] against your Elasticsearch metrics store on host [10.20.0.49] at port [9200].', None)

2020-07-30 17:45:07,187 ActorAddr-(T|:41153)/PID:13577 esrally.actor ERROR Main driver received a fatal exception from a load generator. Shutting down.
2020-07-30 17:45:07,188 ActorAddr-(T|:41153)/PID:13577 esrally.metrics INFO Closing metrics store.
2020-07-30 17:46:07,193 ActorAddr-(T|:42513)/PID:13497 esrally.actor INFO Received a benchmark failure from [ActorAddr-(T|:41153)] and will forward it now.
2020-07-30 17:46:07,195 -not-actor-/PID:13489 esrally.racecontrol ERROR A benchmark failure has occurred
2020-07-30 17:46:07,195 -not-actor-/PID:13489 esrally.racecontrol INFO Telling benchmark actor to exit.
2020-07-30 17:46:07,196 ActorAddr-(T|:42513)/PID:13497 esrally.actor INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring).
2020-07-30 17:46:07,197 ActorAddr-(T|:40473)/PID:13576 esrally.actor INFO MechanicActor#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:42513)])
2020-07-30 17:46:07,197 ActorAddr-(T|:41153)/PID:13577 esrally.actor INFO Main driver received ActorExitRequest and will terminate all load generators.
2020-07-30 17:46:07,199 ActorAddr-(T|:42513)/PID:13497 esrally.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:40473)] (ignoring).
2020-07-30 17:46:07,200 ActorAddr-(T|:42513)/PID:13497 esrally.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:41153)] (ignoring).
2020-07-30 17:46:10,199 -not-actor-/PID:13489 esrally.rally INFO Attempting to shutdown internal actor system.
2020-07-30 17:46:10,201 -not-actor-/PID:13496 root INFO ActorSystem Logging Shutdown
2020-07-30 17:46:10,222 -not-actor-/PID:13495 root INFO ---- Actor System shutdown
2020-07-30 17:46:10,222 -not-actor-/PID:13489 esrally.rally INFO Actor system is still running. Waiting...
2020-07-30 17:46:11,224 -not-actor-/PID:13489 esrally.rally INFO Shutdown completed.
2020-07-30 17:46:11,224 -not-actor-/PID:13489 esrally.rally ERROR Cannot run subcommand [race].
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/esrally/rally.py", line 714, in dispatch_sub_command
    race(cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/rally.py", line 624, in race
    with_actor_system(racecontrol.run, cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/rally.py", line 652, in with_actor_system
    runnable(cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 365, in run
    raise e
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 362, in run
    pipeline(cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 79, in __call__
    self.target(cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 298, in benchmark_only
    return race(cfg, external=True)
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 250, in race
    raise exceptions.RallyError(result.message, result.cause)
esrally.exceptions.RallyError: ("Error in race control (('An unknown error occurred while running the operation [bulk] against your Elasticsearch metrics store on host [10.20.0.49] at port [9200].', None))", None)

Not sure why this is happening

Christian_Dahlqvist · July 30, 2020, 6:29pm

Which Elasticsearch version are you using? What is the specification of your cluster? How many master eligible nodes do you have? How are the nodes configured? Was the cluster status green before you started?

lee-o · August 6, 2020, 5:08pm

I experienced the same issue.

Rally 2.0.0 - Elasticsearch 7.7.0

When testing a single node cluster.

The main problem is that Rally just hangs without any output to the main thread.

danielmitterdorfer · August 10, 2020, 8:32am

Hi,

can you please execute the following before you start the benchmark with Rally?

export THESPLOG_FILE="${HOME}/.rally/logs/actor-system-internal.log"
export THESPLOG_FILE_MAXSIZE=204800
export THESPLOG_THRESHOLD="INFO"

This enables an additional log file of a system that Rally uses internally for communication. The log file will be placed in ~/.rally/logs/actor-system-internal.log.

Can you please provide the contents of this file when the benchmark is hanging again? Can you please also share the exact command line parameters that you use when you start Rally (I am specifically interested in how many clients you use in your benchmark)?

Thanks!

Daniel

system · September 7, 2020, 8:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Benchmarking Remote Cluster Stalls in Preparing Elasticsearch rally	7	1387	November 18, 2019
Esrally stalls when running in benchmark-only mode Elasticsearch rally	10	2303	February 24, 2017
Rally gets stuck while benchmarking Elasticsearch rally	5	1435	October 25, 2018
Running benchmark-only stalls Elasticsearch rally	5	584	May 31, 2020
Rally race gets stuck on check-cluster-health Elasticsearch rally	5	1468	April 27, 2021

Rally stalls when I try to benchmark Elasticsearch cluster

Related topics