Esrally stalls when running in benchmark-only mode


(Anshum Gupta) #1

I have been trying to run esrally using the following command:
esrally --pipeline=benchmark-only --target-hosts=host:port1,host:port2

I am running a vanilla es setup with 2 nodes, version 2.4.0 that I'm trying to benchmark.
I am on Mac OSX Sierra, Java version - 1.8.0_112
Python Version: 3.5.0

This however stalls at:
Running scroll [some percentage]

The percentage varies on each run.

Here's the last few lines of the logs:

2017-01-25 20:50:37,848 rally.driver INFO Scheduling next task for client id [1] at their timestamp [752349.383946] (master timestamp [752349.400726])
2017-01-25 20:50:37,848 rally.driver INFO Scheduling next task for client id [2] at their timestamp [752349.384846] (master timestamp [752349.400726])
2017-01-25 20:50:37,848 rally.driver INFO Client [0] is executing [Task for [scroll]].
2017-01-25 20:50:37,849 rally.driver INFO Creating iteration-count based schedule for [scroll] with [500] warmup iterations and [1000] iterations.
2017-01-25 20:50:37,849 rally.driver INFO Scheduling next task for client id [3] at their timestamp [752349.386791] (master timestamp [752349.400726])
2017-01-25 20:50:37,849 rally.driver INFO client [1] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,850 rally.driver INFO Scheduling next task for client id [4] at their timestamp [752349.386906] (master timestamp [752349.400726])
2017-01-25 20:50:37,850 rally.driver INFO client [2] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,850 rally.driver INFO client [3] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,850 rally.driver INFO Scheduling next task for client id [5] at their timestamp [752349.389782] (master timestamp [752349.400726])
2017-01-25 20:50:37,851 rally.driver INFO Scheduling next task for client id [6] at their timestamp [752349.390083] (master timestamp [752349.400726])
2017-01-25 20:50:37,851 rally.driver INFO client [4] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,852 rally.driver INFO Scheduling next task for client id [7] at their timestamp [752349.392962] (master timestamp [752349.400726])
2017-01-25 20:50:37,852 rally.driver INFO client [5] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,853 rally.driver INFO client [6] reached join point [JoinPoint(10)].
2017-01-25 20:50:37,854 rally.driver INFO client [7] reached join point [JoinPoint(10)].
2017-01-26 06:43:39,125 root ERROR Cannot run subcommand [race].
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 234, in run
    pipeline(cfg)
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 48, in __call__
    self.target(cfg)
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 182, in benchmark_only
    return race(Benchmark(cfg, mechanic.create(cfg, metrics_store, external=True), metrics_store), cfg)
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 156, in race
    benchmark.run(lap)
  File "/usr/local/lib/python3.5/site-packages/esrally/racecontrol.py", line 70, in run
    driver.StartBenchmark(self.cfg, self.track, self.metrics_store.meta_info, self.metrics_store.lap))
  File "/usr/local/lib/python3.5/site-packages/thespian/actors.py", line 714, in ask
    return self._systemBase.ask(actorAddr, msg, timeout)
  File "/usr/local/lib/python3.5/site-packages/thespian/system/systemBase.py", line 271, in ask
    response = self.transport.run(None, toTimeDeltaOrNone(timeout))
  File "/usr/local/lib/python3.5/site-packages/thespian/system/transport/wakeupTransportBase.py", line 68, in run
    rval = self._runWithExpiry(incomingHandler)
  File "/usr/local/lib/python3.5/site-packages/thespian/system/transport/TCPTransport.py", line 1050, in _runWithExpiry

I am not sure if this is a known issue, or I'm doing something wrong.


(Daniel Mitterdorfer) #2

Hi @anshumg,

it's not a known issue and your invocation also seems fine. I'm a seeing that correctly that you let it run over the night and it then failed at 06:43 in the morning?

The failure is most likely that Rally has timed out waiting for the benchmark to finish. However, I am not yet sure about the root cause (i.e. why the scrolls take so long).

  • How much memory are you allocating for your nodes?
  • Do they spend a lot of time garbage-collecting? (see the Elasticsearch logs)

Daniel


(Anshum Gupta) #3

That's about the right amount of time, though it wasn't overnight as the timezone on my machine is not in-sync with my location.

I just ran with the defaults.

Here's all of the elasticsearch.log (assuming that is what you were referring to):

Node 1:

[2017-01-25 10:37:01,975][INFO ][node ] [Hurricane] version[2.4.0], pid[79728], build[ce9f0c7/2016-08-29T09:14:17Z]
[2017-01-25 10:37:01,977][INFO ][node ] [Hurricane] initializing ...
[2017-01-25 10:37:02,481][INFO ][plugins ] [Hurricane] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2017-01-25 10:37:02,507][INFO ][env ] [Hurricane] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [102.4gb], net total_space [464.7gb], spins? [unknown], types [hfs]
[2017-01-25 10:37:02,507][INFO ][env ] [Hurricane] heap size [989.8mb], compressed ordinary object pointers [true]
[2017-01-25 10:37:02,507][WARN ][env ] [Hurricane] max file descriptors [10240] for elasticsearch process likely too low, consider increasing to at least [65536]
[2017-01-25 10:37:03,949][INFO ][node ] [Hurricane] initialized
[2017-01-25 10:37:03,949][INFO ][node ] [Hurricane] starting ...
[2017-01-25 10:37:04,026][INFO ][transport ] [Hurricane] publish_address {127.0.0.1:9300}, bound_addresses {[fe80::1]:9300}, {[::1]:9300}, {127.0.0.1:9300}
[2017-01-25 10:37:04,031][INFO ][discovery ] [Hurricane] elasticsearch/MzFm25SCRbyzGLEFh7f7sw
[2017-01-25 10:37:07,064][INFO ][cluster.service ] [Hurricane] new_master {Hurricane}{MzFm25SCRbyzGLEFh7f7sw}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2017-01-25 10:37:07,079][INFO ][http ] [Hurricane] publish_address {127.0.0.1:9200}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9200}
[2017-01-25 10:37:07,079][INFO ][node ] [Hurricane] started
[2017-01-25 10:37:07,117][INFO ][gateway ] [Hurricane] recovered [1] indices into cluster_state
[2017-01-25 10:37:07,378][INFO ][cluster.routing.allocation] [Hurricane] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[customer][2], [customer][3], [customer][3]] ...]).
[2017-01-25 10:38:15,013][INFO ][cluster.metadata ] [Hurricane] [twitter] creating index, cause [auto(index api)], templates [], shards [5]/[1], mappings []
[2017-01-25 10:38:15,061][INFO ][cluster.routing.allocation] [Hurricane] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[twitter][4]] ...]).
[2017-01-25 10:38:15,075][INFO ][cluster.metadata ] [Hurricane] [twitter] create_mapping [tweet]
[2017-01-25 10:40:14,384][INFO ][cluster.metadata ] [Hurricane] [geonames] creating index, cause [api], templates [], shards [5]/[0], mappings []
[2017-01-25 10:40:14,431][INFO ][cluster.metadata ] [Hurricane] [geonames] create_mapping [type]
[2017-01-25 10:40:14,434][INFO ][cluster.routing.allocation] [Hurricane] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[geonames][4], [geonames][4]] ...]).
[2017-01-25 10:56:40,632][INFO ][cluster.metadata ] [Hurricane] updating number_of_replicas to [1] for indices [twitter]
[2017-01-25 11:04:24,090][INFO ][cluster.service ] [Hurricane] added {{node-2}{ud8UTpFLQ5my_NmVlDfTeg}{127.0.0.1}{127.0.0.1:9301},}, reason: zen-disco-join(join from node[{node-2}{ud8UTpFLQ5my_NmVlDfTeg}{127.0.0.1}{127.0.0.1:9301}])
[2017-01-25 11:04:27,059][INFO ][cluster.routing.allocation] [Hurricane] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[customer][4]] ...]).
[2017-01-25 11:07:03,610][INFO ][cluster.metadata ] [Hurricane] [geonames] creating index, cause [api], templates [], shards [5]/[0], mappings []
[2017-01-25 11:07:03,653][INFO ][cluster.routing.allocation] [Hurricane] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[geonames][0], [geonames][0]] ...]).
[2017-01-25 11:07:03,662][INFO ][cluster.metadata ] [Hurricane] [geonames] create_mapping [type]
[2017-01-25 12:31:46,783][INFO ][cluster.metadata ] [Hurricane] [geonames] creating index, cause [api], templates [], shards [5]/[0], mappings []
[2017-01-25 12:31:46,836][INFO ][cluster.routing.allocation] [Hurricane] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[geonames][0], [geonames][0]] ...]).
[2017-01-25 12:31:46,843][INFO ][cluster.metadata ] [Hurricane] [geonames] create_mapping [type]


(Anshum Gupta) #4

Here are the logs from the 2nd node (pasting them together makes me exceed the posting limit):

Node 2:

[2017-01-25 11:04:19,552][INFO ][node ] [node-2] version[2.4.0], pid[80722], build[ce9f0c7/2016-08-29T09:14:17Z]
[2017-01-25 11:04:19,552][INFO ][node ] [node-2] initializing ...
[2017-01-25 11:04:19,932][INFO ][plugins ] [node-2] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2017-01-25 11:04:19,947][INFO ][env ] [node-2] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [101.7gb], net total_space [464.7gb], spins? [unknown], types [hfs]
[2017-01-25 11:04:19,947][INFO ][env ] [node-2] heap size [989.8mb], compressed ordinary object pointers [true]
[2017-01-25 11:04:19,948][WARN ][env ] [node-2] max file descriptors [10240] for elasticsearch process likely too low, consider increasing to at least [65536]
[2017-01-25 11:04:20,963][INFO ][node ] [node-2] initialized
[2017-01-25 11:04:20,963][INFO ][node ] [node-2] starting ...
[2017-01-25 11:04:21,022][INFO ][transport ] [node-2] publish_address {127.0.0.1:9301}, bound_addresses {[fe80::1]:9301}, {[::1]:9301}, {127.0.0.1:9301}
[2017-01-25 11:04:21,026][INFO ][discovery ] [node-2] elasticsearch/ud8UTpFLQ5my_NmVlDfTeg
[2017-01-25 11:04:24,096][INFO ][cluster.service ] [node-2] detected_master {Hurricane}{MzFm25SCRbyzGLEFh7f7sw}{127.0.0.1}{127.0.0.1:9300}, added {{Hurricane}{MzFm25SCRbyzGLEFh7f7sw}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [{Hurricane}{MzFm25SCRbyzGLEFh7f7sw}{127.0.0.1}{127.0.0.1:9300}])
[2017-01-25 11:04:24,152][INFO ][http ] [node-2] publish_address {127.0.0.1:9201}, bound_addresses {[fe80::1]:9201}, {[::1]:9201}, {127.0.0.1:9201}
[2017-01-25 11:04:24,152][INFO ][node ] [node-2] started
[2017-01-25 11:04:24,286][WARN ][transport ] [node-2] Transport response handler not found of id [17]
[2017-01-25 11:04:24,286][WARN ][transport ] [node-2] Transport response handler not found of id [16]
[2017-01-25 11:04:24,896][WARN ][transport ] [node-2] Transport response handler not found of id [22]
[2017-01-25 11:04:24,913][WARN ][transport ] [node-2] Transport response handler not found of id [23]
[2017-01-25 11:04:25,432][WARN ][transport ] [node-2] Transport response handler not found of id [28]
[2017-01-25 11:04:25,479][WARN ][transport ] [node-2] Transport response handler not found of id [30]
[2017-01-25 11:04:25,968][WARN ][transport ] [node-2] Transport response handler not found of id [33]
[2017-01-25 11:04:26,003][WARN ][transport ] [node-2] Transport response handler not found of id [36]
[2017-01-25 11:04:26,530][WARN ][transport ] [node-2] Transport response handler not found of id [45]
[2017-01-25 11:04:27,078][WARN ][transport ] [node-2] Transport response handler not found of id [48]
[2017-01-25 11:04:27,082][WARN ][transport ] [node-2] Transport response handler not found of id [49]


(Daniel Mitterdorfer) #5

Hi @anshumg,

times in the the Rally logs are always gathered in UTC, so you're right... .

W.r.t. to your actual issue: I suspected that due to the default configuration (just 1GB main memory) your nodes would be stuck in garbage collection but this is not the case.

I also tried to reproduce the issue locally on my Mac Book with the following hardware configuration: 16 GB RAM, 256 GB SSD and i7-4870HQ CPU @ 2.50GHz. The benchmark ran fine and none of the nodes logged any GC activity (meaning that GC wasn't causing so much overhead to appear in the Elasticsearch logs).

Overall, the benchmark took 1700 seconds. Here are my results for the scroll query:

|   Lap |                         Metric |            Operation |     Value |   Unit |
|------:|-------------------------------:|---------------------:|----------:|-------:|
|   All |                 Min Throughput |               scroll |   63.8736 |  ops/s |
|   All |              Median Throughput |               scroll |   64.7807 |  ops/s |
|   All |                 Max Throughput |               scroll |   65.3526 |  ops/s |
|   All |      50.0th percentile latency |               scroll |    366102 |     ms |
|   All |      90.0th percentile latency |               scroll |    508101 |     ms |
|   All |      99.0th percentile latency |               scroll |    540306 |     ms |
|   All |      99.9th percentile latency |               scroll |    543480 |     ms |
|   All |       100th percentile latency |               scroll |    543839 |     ms |
|   All | 50.0th percentile service time |               scroll |   376.909 |     ms |
|   All | 90.0th percentile service time |               scroll |   388.145 |     ms |
|   All | 99.0th percentile service time |               scroll |    416.25 |     ms |
|   All | 99.9th percentile service time |               scroll |   621.818 |     ms |
|   All |  100th percentile service time |               scroll |   694.482 |     ms |

So while we clearly see that this hardware cannot cope with the desired target throughput (latency goes through the roof), the benchmark finished fine.

Any chance that you are saturated on RAM, have a spinning disk or anything like that?

You can try to dig deeper by having Rally provision the two nodes for you (it seems you just run the standard configuration):

esrally --distribution-version=2.4.0 --car=two_nodes --telemetry=gc,jfr

Note: JFR will enable Java flight recorder, which will only work if you have Oracle JDK installed. If you have OpenJDK installed, then don't specify it. This will enable garbage collection logs on both nodes and Java flight recorder which should hopefully tell you why the node is stalled.

Daniel


(Anshum Gupta) #6

I'll try re-reunning w/ Rally provisioning the two nodes tomorrow morning.

I don't think hardware is an issue as my macbook is spec'ed exactly as yours.


(Anshum Gupta) #7

I tried running using those params and here's the error this time around (from gc log):

2017-01-26T10:13:52.751+0800: 5644.424: [GC (Allocation Failure) 2017-01-26T10:13:52.753+0800: 5644.426: [ParNew
Desired survivor size 4456448 bytes, new threshold 6 (max 6)

  • age 1: 161776 bytes, 161776 total
  • age 2: 206736 bytes, 368512 total
  • age 3: 3031880 bytes, 3400392 total
  • age 4: 112 bytes, 3400504 total
  • age 5: 336 bytes, 3400840 total
  • age 6: 112 bytes, 3400952 total

It's stalled at :
Running index-append [ 92% done]

Also, the console logs the following line about flight recording:

[INFO] Flight Recorder: Writing flight recording to [/Users/anshumgupta/.rally/benchmarks/races/2017-01-26-16-39-21/local/tracks/geonames/append-no-conflicts/telemetry/two_nodes-rally-node1.jfr]

but, there's no such file at the location.

Also, to clarify, here's my Java version (Oracle JDK):
sherlock:elastic anshumgupta$ java -version java version "1.8.0_112" Java(TM) SE Runtime Environment (build 1.8.0_112-b16) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b16, mixed mode)


(Daniel Mitterdorfer) #8

Hi @anshumg,

This is just normal GC business. :slight_smile: It basically means that a young generation GC has been run and that's nothing to worry about.

The flight recording is only written upon termination of the JVM. So you could just kill the two Elasticsearch nodes with kill -9 PID_OF_THE_PROCESSES (the easiest way to find the PIDs is to run jps -v).

However, I don't really see a good reason why the benchmark should stall on such a beefy machine. So for the next steps I suggest:

Can you please run a very reduced version of the benchmark (this is just to see whether it terminates at all on your machine)?

esrally --distribution-version=2.4.0 --car=two_nodes --telemetry=gc,jfr --test-mode

The ---test-mode parameter indexes just a tiny bit of the data and runs just one iteration of each query. It should finish pretty fast (in less than two minutes on your machine).

If that works fine, can you please run the full benchmark again?

esrally --distribution-version=2.4.0 --car=two_nodes --telemetry=gc,jfr

Can you please keep an eye on "Activity Monitor" whether main memory is exhausted?

If the benchmark is not terminating "within reasonable time" (should finish within an hour), can you please terminate it, manually kill the Elasticsearch instances in case Rally did not terminate them and provide all files in the race directory (i.e. ~/.rally/benchmarks/races/YOUR_RACE_TIMESTAMP)?

Daniel


(Anshum Gupta) #9

Thanks Daniel. I was able to run that successfully, and was able to run the benchmark with these params successfully too:

--car=two_nodes --telemetry=gc,jfrl

Not sure what made the earlier runs fail so I'll retry running this and see if I hit the issue again. Also, I'm yet to run this with the original params to benchmark my manually setup instance.

I'll post the update once that's done.


(Anshum Gupta) #10

After the one successful run, I haven't been able to get it to complete. I've sent you the logs so perhaps something will come out of it.

There wasn't much that I could figure from the logs but I'm trying to run the minimal test again, with rally provisioning the instances for the benchmarking instead of pointing it to my instances.


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.