Index-and-query-logs-fixed-daily-volume challenge exception

Hello , I've got the exception during the execution of index-and-query-logs-fixed-daily-volume

I've executed the index-logs-fixed-daily-volume and it generated the data in ES
esrally --track-repository=eventdata --track=eventdata --challenge=index-logs-fixed-daily-volume --target-hosts=10.145.14.242:9200 --track-params="bulk_size:10000,bulk_indexing_clients:64,number_of_shards:12,number_of_days:10,daily_logging_volume:100GB" --pipeline=benchmark-only

after I execute the command:

esrally --track-repository=eventdata --track=eventdata --challenge=index-and-query-logs-fixed-daily-volume --target-hosts=10.145.14.242:9200 --track-params="bulk_size:10000,bulk_indexing_clients:32,search_clients:32,number_of_days:10,daily_logging_volume:100GB" --pipeline=benchmark-only

and after testing throughput step it throws exception, what could be a problem?

> 2020-11-11 17:06:32,542 -not-actor-/PID:6589 esrally.rally ERROR Cannot run subcommand [race].
> Traceback (most recent call last):
>   File "/home/user/.local/lib/python3.8/site-packages/esrally/rally.py", line 770, in dispatch_sub_command
>     race(cfg)
>   File "/home/user/.local/lib/python3.8/site-packages/esrally/rally.py", line 680, in race
>     with_actor_system(racecontrol.run, cfg)
>   File "/home/user/.local/lib/python3.8/site-packages/esrally/rally.py", line 708, in with_actor_system
>     runnable(cfg)
>   File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 375, in run
>     raise e
>   File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 372, in run
>     pipeline(cfg)
>   File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 60, in __call__
>     self.target(cfg)
>   File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 305, in benchmark_only
>     return race(cfg, external=True)
>   File "/home/user/.local/lib/python3.8/site-packages/esrally/racecontrol.py", line 251, in race
>     raise exceptions.RallyError(result.message, result.cause)
> esrally.exceptions.RallyError: ('Error in load generator [43]', '("Cannot execute [user-defined runner for [kibana_async]]. Provided parameters are: [\'body\', \'meta_data\']. Error: [\'took\'].", None)')
> 2020-11-11 17:06:42,447 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [3] has exited.
> 2020-11-11 17:06:42,456 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [3] has exited.
> 2020-11-11 17:06:43,356 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [12] has exited.
> 2020-11-11 17:06:43,368 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [12] has exited.
> 2020-11-11 17:06:44,901 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [4] has exited.
> 2020-11-11 17:06:44,913 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [4] has exited.
> 2020-11-11 17:06:53,716 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [15] has exited.
> 2020-11-11 17:06:53,725 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [15] has exited.
> 2020-11-11 17:07:03,765 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [6] has exited.
> 2020-11-11 17:07:03,780 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [6] has exited.
> 2020-11-11 17:07:29,24 ActorAddr-(T|:37445)/PID:6649 esrally.actor INFO Worker [14] has exited.
> 2020-11-11 17:07:29,25 ActorAddr-(T|:37107)/PID:6600 esrally.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:37445)] (ignoring).
> 2020-11-11 17:07:29,27 -not-actor-/PID:6599 root INFO ActorSystem Logging Shutdown
> control.py-11-11 17:07:29,47 -not-actor-/PID:6598 root INFO ---- Actor System shutdown

Thanks in advance
Oleg.

Did you check the Elasticsearch logs? It's possible that with parallel indexing + quering to such a high values you've exceeded the search (or some other) queue capacity.

Unrelated to this question, bear in mind this challenge is typically run after executing first https://github.com/elastic/rally-eventdata-track#index-logs-fixed-daily-volume, the logic being that first you backfill your data (simulating a cluster that already has ingested X days worth of data) and then the challenge you've used is more about indexing less data (with varying loads) and querying in parallel, to simulate a real index+search use case.

1 Like

Thank you for the answer, I do execute the previous challenge and populate the cluster wit the data

index-logs-fixed-daily-volume

in rally terminal exception is :

I don't see an exceptions in elastic search:

> [2020-11-11T18:08:06,627][INFO ][o.e.m.j.JvmGcMonitorService] [oleg-elastic1] [gc][30701] overhead, spent [417ms] collecting in the last [1s]
> [2020-11-11T18:08:06,950][INFO ][o.e.c.r.a.AllocationService] [oleg-elastic1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[elasticlogs-2999-01-01-throughput-test][61], [elasticlogs-2999-01-01-throughput-test][60], [elasticlogs-2999-01-01-throughput-test][62]]]).
> [2020-11-11T18:08:07,628][INFO ][o.e.m.j.JvmGcMonitorService] [oleg-elastic1] [gc][30702] overhead, spent [259ms] collecting in the last [1s]
> [2020-11-11T18:11:47,176][WARN ][o.e.t.TransportService   ] [oleg-elastic1] Received response for a request that has timed out, sent [19316ms] ago, timed out [4291ms] ago, action [cluster:monitor/nodes/stats[n]], node [{oleg-elastic1}{AEycWRe1TKaM3k3ZCDS9GQ}{PD0xaXOJQIWs3ipy7ZR42g}{10.145.14.241}{10.145.14.241:9300}{dilmrt}{ml.machine_memory=50630656000, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}], id [52906433]
> [2020-11-11T18:13:32,136][WARN ][o.e.t.TransportService   ] [oleg-elastic1] Received response for a request that has timed out, sent [28616ms] ago, timed out [13736ms] ago, action [cluster:monitor/nodes/stats[n]], node [{oleg-elastic1}{AEycWRe1TKaM3k3ZCDS9GQ}{PD0xaXOJQIWs3ipy7ZR42g}{10.145.14.241}{10.145.14.241:9300}{dilmrt}{ml.machine_memory=50630656000, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}], id [53097199]
> [2020-11-11T18:15:42,292][WARN ][o.e.c.InternalClusterInfoService] [oleg-elastic1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
> [2020-11-11T18:15:52,753][WARN ][o.e.t.TransportService   ] [oleg-elastic1] Received response for a request that has timed out, sent [40432ms] ago, timed out [25381ms] ago, action [cluster:monitor/nodes/stats[n]], node [{oleg-elastic1}{AEycWRe1TKaM3k3ZCDS9GQ}{PD0xaXOJQIWs3ipy7ZR42g}{10.145.14.241}{10.145.14.241:9300}{dilmrt}{ml.machine_memory=50630656000, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}], id [53349175]
> [2020-11-11T18:16:45,556][INFO ][o.e.m.j.JvmGcMonitorService] [oleg-elastic1] [gc][31218] overhead, spent [277ms] collecting in the last [1s]
> [2020-11-11T18:18:07,807][INFO ][o.e.c.m.MetadataDeleteIndexService] [oleg-elastic1] [elasticlogs-2999-01-01-throughput-test/oSgKGhhbS46PK-b4c_RJLA] deleting index
> [2020-11-11T18:18:16,280][INFO ][o.e.m.j.JvmGcMonitorService] [oleg-elastic1] [gc][young][31307][64998] duration [1.9s], collections [2]/[2.5s], total [1.9s]/[41m], memory [9.8gb]->[7.9gb]/[31.6gb], all_pools {[young] [2.7gb]->[4.5mb]/[2.7gb]}{[survivor] [319.9mb]->[357.7mb]/[357.7mb]}{[old] [6.7gb]->[7.6gb]/[28.5gb]}
> [2020-11-11T18:18:16,281][WARN ][o.e.m.j.JvmGcMonitorService] [oleg-elastic1] [gc][31307] overhead, spent [1.9s] collecting in the last [2.5s]

I am working for quite some time to bring elastic to maximum IO capacity. I tried for so many options and I am getting to the maximum of capacity by peaks - 1200MB per second but I can't bring it on sustain. I thought using reads and writes together could bring the desired result

This is what I reach so far:

. The Upper left chart is 384MB/s

What could be to fix the exception?

Seeing such messages, as well as:

> [2020-11-11T18:18:16,281][WARN ][o.e.m.j.JvmGcMonitorService] [oleg-elastic1] [gc][31307] overhead, spent [1.9s] collecting in the last [2.5s]

76% time spend in GC means that your cluster is definitely badly saturated. Check CPU usage (as well as disk IO usage, but you've checked that already) and please do that for the load driver too.

Also re: the timeout errors in ES, keep in mind that the default client timeout in Rally is 60s, this can be adjusted in Command Line Reference - Rally 2.10.0 documentation

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.