Rally gets stuck while benchmarking

akhil-rane · September 17, 2018, 7:15pm

I am benchmarking a elaticsearch cluster but Rally gets stuck with the following output on the command line.

/ __ ____ / / / __
/ // / __ `/ / / / / /
/ , / // / / / // /
// ||_,///_, /
/____/

[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.

************** WARNING: A dark dungeon lies ahead of you **************

Rally does not have control over the configuration of the benchmarked
Elasticsearch cluster.

Be aware that results may be misleading due to problems with the setup.
Rally is also not able to gather lots of metrics at all (like CPU usage
of the benchmarked cluster) or may even produce misleading metrics (like
the index size).

****** Use this pipeline only if you are aware of the tradeoffs. ******
*************************** Watch your step! ***************************

[INFO] Racing on track [ilp_track], challenge [index il-search performance data] and car ['external'] with version [5.2.1].

[WARNING] indexing_total_time is 37333 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 11535 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 1146 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.

Following is the log output:

2018-09-17 19:01:20,933 ActorAddr-(T|:33393)/PID:17984 esrally.actor INFO Telling driver to start benchmark.
2018-09-17 19:01:20,853 ActorAddr-(T|:36524)/PID:17991 esrally.mechanic.launcher INFO REST API is available. Attaching telemetry devices to cluster.
2018-09-17 19:01:20,908 ActorAddr-(T|:36524)/PID:17991 esrally.mechanic.launcher INFO Telemetry devices are now attached to the cluster.
2018-09-17 19:01:20,909 ActorAddr-(T|:36524)/PID:17991 esrally.actor INFO Transitioning from [nodes_started] to [apply_meta_info].
2018-09-17 19:01:20,930 ActorAddr-(T|:36524)/PID:17991 esrally.actor INFO Transitioning from [cluster_started] to [benchmark_starting].
2018-09-17 19:01:20,965 ActorAddr-(T|:36524)/PID:17991 esrally.mechanic.telemetry WARNING indexing_total_time is 37333 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-09-17 19:01:20,966 ActorAddr-(T|:36524)/PID:17991 esrally.mechanic.telemetry WARNING refresh_total_time is 11535 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-09-17 19:01:20,967 ActorAddr-(T|:36524)/PID:17991 esrally.mechanic.telemetry WARNING flush_total_time is 1146 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-09-17 19:01:20,992 ActorAddr-(T|:33393)/PID:17984 esrally.actor INFO BenchmarkActor received unknown message [<esrally.mechanic.mechanic.BenchmarkStarted object at 0x7f3dae344dd8>] (ignoring).
2018-09-17 19:01:21,43 ActorAddr-(T|:40809)/PID:17993 esrally.metrics INFO Creating in-memory metrics store
2018-09-17 19:01:21,45 ActorAddr-(T|:40809)/PID:17993 esrally.metrics INFO Opening metrics store for trial timestamp=[20180917T190119Z], track=[ilp_track], challenge=[index il-search performance data], car=[['external']]
2018-09-17 19:01:21,46 ActorAddr-(T|:40809)/PID:17993 esrally.actor INFO Capabilities [{'Thespian Generation': (3, 9), 'Thespian Version': '1537210880128', 'ip': '127.0.0.1', 'Thespian ActorSystem Version': 2, 'Python Version': (3, 5, 6, 'final', 0), 'coordinator': True, 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian Watch Supported': True}] match requirements [{'coordinator': True}].
2018-09-17 19:01:21,181 ActorAddr-(T|:40809)/PID:17993 esrally.driver.driver INFO Benchmark is about to start.
2018-09-17 19:01:21,169 ActorAddr-(T|:37553)/PID:17994 esrally.actor INFO Preparing track [ilp_track]
2018-09-17 19:01:21,176 ActorAddr-(T|:37553)/PID:17994 esrally.track.loader INFO Resolved data root directory for document corpus [il-search-performance-data] in track [ilp_track] to ['/home/arane/rally-tracks/ilp_track', '/root/.rally/benchmarks/data/il-search-performance-data'].

Follwoing is the output of /tmp/thespian.log

2018-09-17 17:27:55.671227 p3123 ERR Socket error sending to ActorAddr-(T|:38743) on <socket.socket fd=10, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.34.33.113', 41456)>: [Errno 110] Connection timed out / 110: ************* TransportIntent(ActorAddr-(T|:38743)-pending-ExpiresIn_0:02:52.671847-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7fd627667710>-quit_0:02:52.671733)
2018-09-17 17:27:55.671849 p3134 ERR Socket error sending to ActorAddr-(T|:38743) on <socket.socket fd=10, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.34.33.113', 41466)>: [Errno 110] Connection timed out / 110: ************* TransportIntent(ActorAddr-(T|:38743)-pending-ExpiresIn_0:02:52.739763-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7fd62766a9e8>-quit_0:02:52.739662)
2018-09-17 17:27:55.673116 p3124 ERR Socket error sending to ActorAddr-(T|:38743) on <socket.socket fd=10, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('10.34.33.113', 41458)>: [Errno 110] Connection timed out / 110: ************* TransportIntent(ActorAddr-(T|:38743)-pending-ExpiresIn_0:02:52.682988-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7fd627667ac8>-quit_0:02:52.682855)

danielmitterdorfer · September 18, 2018, 5:31am

Hi,

can you please tell me:

the version of Rally that you're using (esrally --version). If it is not Rally 1.0.1, I'd advise to upgrade.
the command line parameters (with any sensitive information, like passwords, masked)
Your timezone offset from UTC to allow correlate the log output (Thespian's logs are in local time and Rally's logs are always in UTC)

Thespian is the actor system that Rally is using internally. Its log output seems unrelated to me (due to different timestamps) but it appears as if Thespian tried to send a message to the IP 10.34.33.113 which failed (Firewall? Rally daemon not running on the remote system?). This is odd because if you use Rally with the "benchmark-only" pipeline, Thespian should not send anything to a remote system so I guess this is related to an earlier experiment you did.

Unfortunately, reconfiguring Thespian's logging is a little bit involved but assuming that you have Rally 1.0.1 installed, you can export the following environment variables to set the default log level to log its output to ~/.rally/logs/actor-system-internal.log and set the log level to INFO:

export THESPLOG_FILE="${THESPLOG_FILE:-${HOME}/.rally/logs/actor-system-internal.log}"
# this value is in bytes, the default is 50kB. We increase it to 200kiB.
export THESPLOG_FILE_MAXSIZE=${THESPLOG_FILE_MAXSIZE:-204800}
# adjust the default log level from WARNING
export THESPLOG_THRESHOLD="INFO"

Ensure that there are no Rally processes currently running (e.g. ps -ef | grep rally) and terminate any remaining ones. Can you then please rerun the benchmark and provide both log outputs once more?

Daniel

akhil-rane · September 18, 2018, 2:24pm

Here is the information you requested

esrally version is 1.0.1
here is the command I executed(with IPs changed): esrally --track-path=/home/arane/rally-tracks/ilp_track --target-hosts=10.24.16.102:9200,10.24.16.103:9200,10.24.16.104:9200 --pipeline=benchmark-only --client-options="basic_auth_user:,basic_auth_password:"
timezone offset from UTC: -4 hours

I closed all rally processes and tried to rerun the benchmarking. Its hanging again. Here are the log outputs:

output of the rally.log

2018-09-18 14:06:27,525 ActorAddr-(T|:45722)/PID:3250 esrally.mechanic.telemetry WARNING indexing_total_time is 37333 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-09-18 14:06:27,526 ActorAddr-(T|:45722)/PID:3250 esrally.mechanic.telemetry WARNING refresh_total_time is 11535 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-09-18 14:06:27,527 ActorAddr-(T|:45722)/PID:3250 esrally.mechanic.telemetry WARNING flush_total_time is 1146 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-09-18 14:06:27,552 ActorAddr-(T|:36789)/PID:3242 esrally.actor INFO BenchmarkActor received unknown message [<esrally.mechanic.mechanic.BenchmarkStarted object at 0x7f0cd077f438>] (ignoring).
2018-09-18 14:06:27,603 ActorAddr-(T|:38905)/PID:3252 esrally.metrics INFO Creating in-memory metrics store
2018-09-18 14:06:27,605 ActorAddr-(T|:38905)/PID:3252 esrally.metrics INFO Opening metrics store for trial timestamp=[20180918T140626Z], track=[ilp_track], challenge=[index il-search performance data], car=[['external']]
2018-09-18 14:06:27,606 ActorAddr-(T|:38905)/PID:3252 esrally.actor INFO Capabilities [{'Thespian Generation': (3, 9), 'Thespian Version': '1537279586514', 'Python Version': (3, 5, 6, 'final', 0), 'Thespian ActorSystem Name': 'multiprocTCPBase', 'ip': '127.0.0.1', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Convention Address.IPv4': '127.0.0.1:1900', 'coordinator': True}] match requirements [{'coordinator': True}].
2018-09-18 14:06:27,729 ActorAddr-(T|:41427)/PID:3253 esrally.actor INFO Preparing track [ilp_track]
2018-09-18 14:06:27,735 ActorAddr-(T|:41427)/PID:3253 esrally.track.loader INFO Resolved data root directory for document corpus [il-search-performance-data] in track [ilp_track] to ['/home/arane/rally-tracks/ilp_track', '/root/.rally/benchmarks/data/il-search-performance-data'].
2018-09-18 14:06:27,742 ActorAddr-(T|:38905)/PID:3252 esrally.driver.driver INFO Benchmark is about to start.

Output of actor-system-internal.log

2018-09-18 10:11:32.475667 p3764 Warn Transmit attempt from ActorAddr-(T|:39124) to ActorAddr-(T|:38905) timed out, returning PoisonPacket
2018-09-18 10:11:32.476347 p3764 I completion error: ************* TransportIntent(ActorAddr-(T|:38905)-=-SENDSTS_FAILED-retry#2-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7f0ccd0b0b00>-quit_0:00:00)
2018-09-18 10:11:32.495595 p3765 Warn Transmit attempt from ActorAddr-(T|:43711) to ActorAddr-(T|:38905) timed out, returning PoisonPacket
2018-09-18 10:11:32.496313 p3765 I completion error: ************* TransportIntent(ActorAddr-(T|:38905)-=-SENDSTS_FAILED-retry#2-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7f0ccd0b0eb8>-quit_0:00:00)
2018-09-18 10:11:32.504658 p3766 Warn Transmit attempt from ActorAddr-(T|:34389) to ActorAddr-(T|:38905) timed out, returning PoisonPacket
2018-09-18 10:11:32.505447 p3766 I completion error: ************* TransportIntent(ActorAddr-(T|:38905)-=-SENDSTS_FAILED-retry#2-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7f0ccd0ab0b8>-quit_0:00:00)
2018-09-18 10:11:32.551288 p3768 Warn Transmit attempt from ActorAddr-(T|:37150) to ActorAddr-(T|:38905) timed out, returning PoisonPacket
2018-09-18 10:11:32.552153 p3768 I completion error: ************* TransportIntent(ActorAddr-(T|:38905)-=-SENDSTS_FAILED-retry#2-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7f0ccd0b4a20>-quit_0:00:00)
2018-09-18 10:11:32.552831 p3769 Warn Transmit attempt from ActorAddr-(T|:43184) to ActorAddr-(T|:38905) timed out, returning PoisonPacket
2018-09-18 10:11:32.553495 p3769 I completion error: ************* TransportIntent(ActorAddr-(T|:38905)-=-SENDSTS_FAILED-retry#2-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7f0ccd0b4dd8>-quit_0:00:00)
2018-09-18 10:11:32.563258 p3767 Warn Transmit attempt from ActorAddr-(T|:43393) to ActorAddr-(T|:38905) timed out, returning PoisonPacket
2018-09-18 10:11:32.563974 p3770 Warn Transmit attempt from ActorAddr-(T|:34913) to ActorAddr-(T|:38905) timed out, returning PoisonPacket
2018-09-18 10:11:32.564147 p3767 I completion error: ************* TransportIntent(ActorAddr-(T|:38905)-=-SENDSTS_FAILED-retry#1-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7f0ccd0b4668>-quit_0:00:00)
2018-09-18 10:11:32.565061 p3770 I completion error: ************* TransportIntent(ActorAddr-(T|:38905)-=-SENDSTS_FAILED-retry#2-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x7f0ccd0b81d0>-quit_0:00:00)

akhil-rane · September 19, 2018, 5:20pm

Hi Daniel,

This issue is resolved. Actually, I was trying to run operations in parallel and I had put 1000 clients for that. My system was not taking the load. When I turned it down to 100 it worked perfectly. The reason I could not understand the problem was that the Rally hung with the following output on the command line.

****** Use this pipeline only if you are aware of the tradeoffs. ******
*************************** Watch your step! ***************************

[INFO] Racing on track [ilp_track], challenge [index il-search performance data] and car ['external'] with version [5.2.1].

[WARNING] indexing_total_time is 37333 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 11535 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 1146 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.

Is there any way we can improve on this?

Akhil

danielmitterdorfer · September 27, 2018, 2:35pm

Hi,

thanks for the feedback. Given that this is process-based 1000 clients might be a bit much indeed. Do you really have that many concurrent connections to Elasticsearch in production? Otherwise I wonder what you want to achieve with this benchmark?

We should probably add a static safeguard to Rally so it will abort early if you attempt to run with that many clients.

Daniel

system · October 25, 2018, 2:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rally does not have control over the configuration of the benchmarked Elasticsearch cluster Elasticsearch rally	4	1059	March 11, 2020
Cluster is not in a defined clean state while using the Elasticsearch setup by Rally Elasticsearch rally	5	761	July 13, 2022
Trying to load data using Rally into Elastic Search Elasticsearch rally	7	1382	May 14, 2018
Rally race gets stuck on check-cluster-health Elasticsearch rally	5	1489	April 27, 2021
Benchmarking results investigation Elasticsearch rally	4	581	June 22, 2018

Rally gets stuck while benchmarking

Related topics