Esrally stalls when running in benchmark-only mode

anshumg · January 26, 2017, 6:54am

I have been trying to run esrally using the following command:
esrally --pipeline=benchmark-only --target-hosts=host:port1,host:port2

I am running a vanilla es setup with 2 nodes, version 2.4.0 that I'm trying to benchmark.
I am on Mac OSX Sierra, Java version - 1.8.0_112
Python Version: 3.5.0

This however stalls at:
Running scroll [some percentage]

The percentage varies on each run.

Here's the last few lines of the logs:

2017-01-25 20:50:37,848 rally.driver INFO Scheduling next task for client id [1] at their timestamp [752349.383946] (master timestamp [752349.400726])
2017-01-25 20:50:37,848 rally.driver INFO Scheduling next task for client id [2] at their timestamp [752349.384846] (master timestamp [752349.400726])
2017-01-25 20:50:37,848 rally.driver INFO Client [0] is executing [Task for [scroll]].
2017-01-25 20:50:37,849 rally.driver INFO Creating iteration-count based schedule for [scroll] with [500] warmup iterations and [1000] iterations.
2017-01-25 20:50:37,849 rally.driver INFO Scheduling next task for client id [3] at their timestamp [752349.386791] (master timestamp [752349.400726])
2017-01-25 20:50:37,849 rally.driver INFO client [1] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,850 rally.driver INFO Scheduling next task for client id [4] at their timestamp [752349.386906] (master timestamp [752349.400726])
2017-01-25 20:50:37,850 rally.driver INFO client [2] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,850 rally.driver INFO client [3] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,850 rally.driver INFO Scheduling next task for client id [5] at their timestamp [752349.389782] (master timestamp [752349.400726])
2017-01-25 20:50:37,851 rally.driver INFO Scheduling next task for client id [6] at their timestamp [752349.390083] (master timestamp [752349.400726])
2017-01-25 20:50:37,851 rally.driver INFO client [4] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,852 rally.driver INFO Scheduling next task for client id [7] at their timestamp [752349.392962] (master timestamp [752349.400726])
2017-01-25 20:50:37,852 rally.driver INFO client [5] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,853 rally.driver INFO client [6] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,854 rally.driver INFO client [7] reached join point [JoinPoint(10)].
2017-01-26 06:43:39,125 root ERROR Cannot run subcommand [race].
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 234, in run
    pipeline(cfg)
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 48, in __call__
    self.target(cfg)
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 182, in benchmark_only
    return race(Benchmark(cfg, mechanic.create(cfg, metrics_store, external=True), metrics_store), cfg)
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 156, in race
    benchmark.run(lap)
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 70, in run
    driver.StartBenchmark(self.cfg, self.track, self.metrics_store.meta_info, self.metrics_store.lap))
  File "/usr/local/lib/python3.5/site-packages/thespian/actors.py", line 714, in ask
    return self._systemBase.ask(actorAddr, msg, timeout)
  File "/usr/local/lib/python3.5/site-packages/thespian/system/systemBase.py", line 271, in ask
    response = self.transport.run(None, toTimeDeltaOrNone(timeout))
  File "/usr/local/lib/python3.5/site-packages/thespian/system/transport/wakeupTransportBase.py", line 68, in run
    rval = self._runWithExpiry(incomingHandler)
  File "/usr/local/lib/python3.5/site-packages/thespian/system/transport/TCPTransport.py", line 1050, in _runWithExpiry

I am not sure if this is a known issue, or I'm doing something wrong.

danielmitterdorfer · January 26, 2017, 7:03am

Hi @anshumg,

it's not a known issue and your invocation also seems fine. I'm a seeing that correctly that you let it run over the night and it then failed at 06:43 in the morning?

The failure is most likely that Rally has timed out waiting for the benchmark to finish. However, I am not yet sure about the root cause (i.e. why the scrolls take so long).

How much memory are you allocating for your nodes?
Do they spend a lot of time garbage-collecting? (see the Elasticsearch logs)

Daniel

anshumg · January 26, 2017, 7:23am

That's about the right amount of time, though it wasn't overnight as the timezone on my machine is not in-sync with my location.

I just ran with the defaults.

Here's all of the elasticsearch.log (assuming that is what you were referring to):

Node 1:

[2017-01-25 10:37:01,975][INFO ][node [2017-01-25 10:37:01,977][INFO ][node [2017-01-25 10:37:02,481][INFO ][plugins [2017-01-25 10:37:02,507][INFO ][env [2017-01-25 10:37:02,507][INFO ][env [2017-01-25 10:37:02,507][WARN ][env [2017-01-25 10:37:03,949][INFO ][node [2017-01-25 10:37:03,949][INFO ][node [2017-01-25 10:37:04,026][INFO ][transport [2017-01-25 10:37:04,031][INFO ][discovery [2017-01-25 10:37:07,064][INFO ][cluster.service [2017-01-25 10:37:07,079][INFO ][http [2017-01-25 10:37:07,079][INFO ][node [2017-01-25 10:37:07,117][INFO ][gateway [2017-01-25 10:37:07,378][INFO ][cluster.r [2017-01-25 10:38:15,013][INFO ][cluster.metadata [2017-01-25 10:38:15,061][INFO ][cluster.r [2017-01-25 10:38:15,075][INFO ][cluster.metadata [2017-01-25 10:40:14,384][INFO ][cluster.metadata [2017-01-25 10:40:14,431][INFO ][cluster.metadata [2017-01-25 10:40:14,434][INFO ][cluster.r [2017-01-25 10:56:40,632][INFO ][cluster.metadata [2017-01-25 11:04:24,090][INFO ][cluster.service [2017-01-25 11:04:27,059][INFO ][cluster.r [2017-01-25 11:07:03,610][INFO ][cluster.metadata [2017-01-25 11:07:03,653][INFO ][cluster.r [2017-01-25 11:07:03,662][INFO ][cluster.metadata [2017-01-25 12:31:46,783][INFO ][cluster.metadata [2017-01-25 12:31:46,836][INFO ][cluster.r [2017-01-25 12:31:46,843][INFO ][cluster.metadata ] [Hurricane] version[2.4.0], pid[79728], build[ce9f0c7/2016-08-29T09:14:17Z]
] [Hurricane] initializing ...
] [Hurricane] modules [reindex, lang-expression, lang-groovy], plugins , sites
] [Hurricane] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [102.4gb], net total_space [464.7gb], spins? [unknown], types [hfs]
] [Hurricane] heap size [989.8mb], compressed ordinary object pointers [true]
] [Hurricane] max file descriptors [10240] for elasticsearch process likely too low, consider increasing to at least [65536]
] [Hurricane] initialized
] [Hurricane] starting ...
] [Hurricane] publish_address {127.0.0.1:9300}, bound_addresses {[fe80::1]:9300}, {[::1]:9300}, {127.0.0.1:9300}
] [Hurricane] elasticsearch/MzFm25SCRbyzGLEFh7f7sw
] [Hurricane] new_master {Hurricane}{MzFm25SCRbyzGLEFh7f7sw}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
] [Hurricane] publish_address {127.0.0.1:9200}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9200}
] [Hurricane] started
] [Hurricane] recovered [1] indices into cluster_state
outing.allocation] [Hurricane] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[customer][2], [customer][3], [customer][3]] ...]).
] [Hurricane] [twitter] creating index, cause [auto(index api)], templates , shards [5]/[1], mappings
outing.allocation] [Hurricane] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[twitter][4]] ...]).
] [Hurricane] [twitter] create_mapping [tweet]
] [Hurricane] [geonames] creating index, cause [api], templates , shards [5]/[0], mappings
] [Hurricane] [geonames] create_mapping [type]
outing.allocation] [Hurricane] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[geonames][4], [geonames][4]] ...]).
] [Hurricane] updating number_of_replicas to [1] for indices [twitter]
] [Hurricane] added {{node-2}{ud8UTpFLQ5my_NmVlDfTeg}{127.0.0.1}{127.0.0.1:9301},}, reason: zen-disco-join(join from node[{node-2}{ud8UTpFLQ5my_NmVlDfTeg}{127.0.0.1}{127.0.0.1:9301}])
outing.allocation] [Hurricane] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[customer][4]] ...]).
] [Hurricane] [geonames] creating index, cause [api], templates , shards [5]/[0], mappings
outing.allocation] [Hurricane] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[geonames][0], [geonames][0]] ...]).
] [Hurricane] [geonames] create_mapping [type]
] [Hurricane] [geonames] creating index, cause [api], templates , shards [5]/[0], mappings
outing.allocation] [Hurricane] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[geonames][0], [geonames][0]] ...]).
] [Hurricane] [geonames] create_mapping [type]

anshumg · January 26, 2017, 7:23am

Here are the logs from the 2nd node (pasting them together makes me exceed the posting limit):

Node 2:

[2017-01-25 11:04:19,552][INFO ][node ] [node-2] version[2.4.0], pid[80722], build[ce9f0c7/2016-08-29T09:14:17Z]
[2017-01-25 11:04:19,552][INFO ][node ] [node-2] initializing ...
[2017-01-25 11:04:19,932][INFO ][plugins ] [node-2] modules [reindex, lang-expression, lang-groovy], plugins , sites
[2017-01-25 11:04:19,947][INFO ][env ] [node-2] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [101.7gb], net total_space [464.7gb], spins? [unknown], types [hfs]
[2017-01-25 11:04:19,947][INFO ][env ] [node-2] heap size [989.8mb], compressed ordinary object pointers [true]
[2017-01-25 11:04:19,948][WARN ][env ] [node-2] max file descriptors [10240] for elasticsearch process likely too low, consider increasing to at least [65536]
[2017-01-25 11:04:20,963][INFO ][node ] [node-2] initialized
[2017-01-25 11:04:20,963][INFO ][node ] [node-2] starting ...
[2017-01-25 11:04:21,022][INFO ][transport ] [node-2] publish_address {127.0.0.1:9301}, bound_addresses {[fe80::1]:9301}, {[::1]:9301}, {127.0.0.1:9301}
[2017-01-25 11:04:21,026][INFO ][discovery ] [node-2] elasticsearch/ud8UTpFLQ5my_NmVlDfTeg
[2017-01-25 11:04:24,096][INFO ][cluster.service ] [node-2] detected_master {Hurricane}{MzFm25SCRbyzGLEFh7f7sw}{127.0.0.1}{127.0.0.1:9300}, added {{Hurricane}{MzFm25SCRbyzGLEFh7f7sw}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [{Hurricane}{MzFm25SCRbyzGLEFh7f7sw}{127.0.0.1}{127.0.0.1:9300}])
[2017-01-25 11:04:24,152][INFO ][http ] [node-2] publish_address {127.0.0.1:9201}, bound_addresses {[fe80::1]:9201}, {[::1]:9201}, {127.0.0.1:9201}
[2017-01-25 11:04:24,152][INFO ][node ] [node-2] started
[2017-01-25 11:04:24,286][WARN ][transport ] [node-2] Transport response handler not found of id [17]
[2017-01-25 11:04:24,286][WARN ][transport ] [node-2] Transport response handler not found of id [16]
[2017-01-25 11:04:24,896][WARN ][transport ] [node-2] Transport response handler not found of id [22]
[2017-01-25 11:04:24,913][WARN ][transport ] [node-2] Transport response handler not found of id [23]
[2017-01-25 11:04:25,432][WARN ][transport ] [node-2] Transport response handler not found of id [28]
[2017-01-25 11:04:25,479][WARN ][transport ] [node-2] Transport response handler not found of id [30]
[2017-01-25 11:04:25,968][WARN ][transport ] [node-2] Transport response handler not found of id [33]
[2017-01-25 11:04:26,003][WARN ][transport ] [node-2] Transport response handler not found of id [36]
[2017-01-25 11:04:26,530][WARN ][transport ] [node-2] Transport response handler not found of id [45]
[2017-01-25 11:04:27,078][WARN ][transport ] [node-2] Transport response handler not found of id [48]
[2017-01-25 11:04:27,082][WARN ][transport ] [node-2] Transport response handler not found of id [49]

danielmitterdorfer · January 26, 2017, 8:04am

Hi @anshumg,

times in the the Rally logs are always gathered in UTC, so you're right... .

W.r.t. to your actual issue: I suspected that due to the default configuration (just 1GB main memory) your nodes would be stuck in garbage collection but this is not the case.

I also tried to reproduce the issue locally on my Mac Book with the following hardware configuration: 16 GB RAM, 256 GB SSD and i7-4870HQ CPU @ 2.50GHz. The benchmark ran fine and none of the nodes logged any GC activity (meaning that GC wasn't causing so much overhead to appear in the Elasticsearch logs).

Overall, the benchmark took 1700 seconds. Here are my results for the scroll query:

|   Lap |                         Metric |            Operation |     Value |   Unit |
|------:|-------------------------------:|---------------------:|----------:|-------:|
|   All |                 Min Throughput |               scroll |   63.8736 |  ops/s |
|   All |              Median Throughput |               scroll |   64.7807 |  ops/s |
|   All |                 Max Throughput |               scroll |   65.3526 |  ops/s |
|   All |      50.0th percentile latency |               scroll |    366102 |     ms |
|   All |      90.0th percentile latency |               scroll |    508101 |     ms |
|   All |      99.0th percentile latency |               scroll |    540306 |     ms |
|   All |      99.9th percentile latency |               scroll |    543480 |     ms |
|   All |       100th percentile latency |               scroll |    543839 |     ms |
|   All | 50.0th percentile service time |               scroll |   376.909 |     ms |
|   All | 90.0th percentile service time |               scroll |   388.145 |     ms |
|   All | 99.0th percentile service time |               scroll |    416.25 |     ms |
|   All | 99.9th percentile service time |               scroll |   621.818 |     ms |
|   All |  100th percentile service time |               scroll |   694.482 |     ms |

So while we clearly see that this hardware cannot cope with the desired target throughput (latency goes through the roof), the benchmark finished fine.

Any chance that you are saturated on RAM, have a spinning disk or anything like that?

You can try to dig deeper by having Rally provision the two nodes for you (it seems you just run the standard configuration):

esrally --distribution-version=2.4.0 --car=two_nodes --telemetry=gc,jfr

Note: JFR will enable Java flight recorder, which will only work if you have Oracle JDK installed. If you have OpenJDK installed, then don't specify it. This will enable garbage collection logs on both nodes and Java flight recorder which should hopefully tell you why the node is stalled.

Daniel

anshumg · January 26, 2017, 9:28am

I'll try re-reunning w/ Rally provisioning the two nodes tomorrow morning.

I don't think hardware is an issue as my macbook is spec'ed exactly as yours.

anshumg · January 26, 2017, 6:35pm

I tried running using those params and here's the error this time around (from gc log):

2017-01-26T10:13:52.751+0800: 5644.424: [GC (Allocation Failure) 2017-01-26T10:13:52.753+0800: 5644.426: [ParNew
Desired survivor size 4456448 bytes, new threshold 6 (max 6)

age 1: 161776 bytes, 161776 total

age 2: 206736 bytes, 368512 total

age 3: 3031880 bytes, 3400392 total

age 4: 112 bytes, 3400504 total

age 5: 336 bytes, 3400840 total

age 6: 112 bytes, 3400952 total

It's stalled at :
Running index-append [ 92% done]

Also, the console logs the following line about flight recording:

[INFO] Flight Recorder: Writing flight recording to [/Users/anshumgupta/.rally/benchmarks/races/2017-01-26-16-39-21/local/tracks/geonames/append-no-conflicts/telemetry/two_nodes-rally-node1.jfr]

but, there's no such file at the location.

Also, to clarify, here's my Java version (Oracle JDK):
sherlock:elastic anshumgupta$ java -version java version "1.8.0_112" Java(TM) SE Runtime Environment (build 1.8.0_112-b16) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b16, mixed mode)

danielmitterdorfer · January 27, 2017, 8:04am

Hi @anshumg,

This is just normal GC business. It basically means that a young generation GC has been run and that's nothing to worry about.

The flight recording is only written upon termination of the JVM. So you could just kill the two Elasticsearch nodes with kill -9 PID_OF_THE_PROCESSES (the easiest way to find the PIDs is to run jps -v).

However, I don't really see a good reason why the benchmark should stall on such a beefy machine. So for the next steps I suggest:

Can you please run a very reduced version of the benchmark (this is just to see whether it terminates at all on your machine)?

esrally --distribution-version=2.4.0 --car=two_nodes --telemetry=gc,jfr --test-mode

The ---test-mode parameter indexes just a tiny bit of the data and runs just one iteration of each query. It should finish pretty fast (in less than two minutes on your machine).

If that works fine, can you please run the full benchmark again?

esrally --distribution-version=2.4.0 --car=two_nodes --telemetry=gc,jfr

Can you please keep an eye on "Activity Monitor" whether main memory is exhausted?

If the benchmark is not terminating "within reasonable time" (should finish within an hour), can you please terminate it, manually kill the Elasticsearch instances in case Rally did not terminate them and provide all files in the race directory (i.e. ~/.rally/benchmarks/races/YOUR_RACE_TIMESTAMP)?

Daniel

anshumg · January 27, 2017, 4:13pm

Thanks Daniel. I was able to run that successfully, and was able to run the benchmark with these params successfully too:

--car=two_nodes --telemetry=gc,jfrl

Not sure what made the earlier runs fail so I'll retry running this and see if I hit the issue again. Also, I'm yet to run this with the original params to benchmark my manually setup instance.

I'll post the update once that's done.

anshumg · January 27, 2017, 8:12pm

After the one successful run, I haven't been able to get it to complete. I've sent you the logs so perhaps something will come out of it.

There wasn't much that I could figure from the logs but I'm trying to run the minimal test again, with rally provisioning the instances for the benchmarking instead of pointing it to my instances.

system · February 24, 2017, 8:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running benchmark-only stalls Elasticsearch rally	5	584	May 31, 2020
How to run esrally in background? Elasticsearch rally	8	2116	July 5, 2017
Rally stalls when I try to benchmark Elasticsearch cluster Elasticsearch docker , rally	4	659	September 7, 2020
Benchmarking Remote Cluster Stalls in Preparing Elasticsearch rally	7	1388	November 18, 2019
Benchmarking High Volumes Elasticsearch rally	2	507	May 11, 2019

Esrally stalls when running in benchmark-only mode

Related topics