Rally Track Report Analysis (follow up)

Continuing the discussion from Rally Track Report Analysis:

Hi,

we are running esrally Version 0.9.4 (with options --pipeline=benchmark-only --client-ptions="use_ssl:true,verify_certs:false,basic_auth_user:'${ESRALLY_USERNAME}',basic_auth_password:'${ESRALLY_PASSWORD}') against elasticsearch 6.2.3 two-node cluster and still experience empty values for overall metrics:

Comparing baseline
Race timestamp: 2018-03-19 11:59:33
Challenge: append-no-conflicts
Car: external

with contender
Race timestamp: 2018-03-27 07:38:10
Challenge: append-no-conflicts
Car: external


_______             __   _____

/ () ____ / / / /_____ ________
/ /_ / / __ / __ `/ / __ / / __ / / _
/ __/ / / / / / /
/ / / / / // // / / / __/
/
/ /
/
/ /
/_
,// /__/_/_// ___/

Metric Task Baseline Contender Diff Unit
Indexing time 0 0 0 min
Indexing throttle time 0 0 0 min
Merge time 0 0 0 min
Refresh time 0 0 0 min
Flush time 0 0 0 min
Merge throttle time 0 0 0 min
Total Young Gen GC 0 0 0 s
Total Old Gen GC 0 0 0 s

Since we are interested in comparing and evaluating these values - could you please help us to have valid entries here?

Tanks,
ize0j10

Hi,

as I've indicated in the other thread, we have mainly improved logging and error handling that area. Can you please share the contents of your Rally log file? I am interested in all lines after:

Gathering indices stats for all primaries on benchmark stop.

This might hopefully help to reveal why you only see zeroes in your environment.

Thank you,

Daniel

Hi,

please find the requested contents of the Rally log file:

2018-04-09 09:25:28,379 PID:15 rally.telemetry INFO Gathering indices stats for all primaries on benchmark stop.
2018-04-09 09:25:38,396 PID:15 elasticsearch WARNING GET https://1.2.3.5:9200/_stats/_all?level=shards [status:N/A request:10.016s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/usr/local/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.6/http/client.py", line 258, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.6/ssl.py", line 1009, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.6/ssl.py", line 871, in read
return self._sslobj.read(len, buffer)
File "/usr/local/lib/python3.6/ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 147, in perform_request
response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
File "/usr/local/lib/python3.6/site-packages/esrally/client.py", line 47, in urlopen
return self.pool.urlopen(method, url, body=body, retries=retries, headers=headers, **kw)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 333, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 389, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='1.2.3.5', port=9200): Read timed out. (read timeout=10)
2018-04-09 09:25:38,397 PID:15 rally.telemetry ERROR Could not retrieve index stats.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/usr/local/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.6/http/client.py", line 258, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.6/ssl.py", line 1009, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.6/ssl.py", line 871, in read
return self._sslobj.read(len, buffer)
File "/usr/local/lib/python3.6/ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 147, in perform_request
response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
File "/usr/local/lib/python3.6/site-packages/esrally/client.py", line 47, in urlopen
return self.pool.urlopen(method, url, body=body, retries=retries, headers=headers, **kw)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 333, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 389, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='1.2.3.5', port=9200): Read timed out. (read timeout=10)

[I had to cut off to respect the limit of a post]

Please note that in the environment http_proxy and HTTP_PROXY are set to allow Rally to respect the proxy settings to download benchmark data from the internet. As stated in the comments of Rally source the proxy settings are not used for connecting to the elasticsearch node(s).

The log file contains timeout errors while fetching " elasticsearch stats:

WARNING GET https://1.2.3.4:9200/_stats/_all?level=shards [status:N/A request:10.016s]".

But the target is available:

curl -k --noproxy "*" https://1.2.3.4:9200/_stats/_all?level=shards > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 392k 100 392k 0 0 1192k 0 --:--:-- --:--:-- --:--:-- 1195k

Any idea?

Hi,

that output is helpful indeed and I think we're getting (much) closer now. It looks indeed as if Rally cannot connect to the cluster for some reason and the connection attempt just times out without any response.

Do you also get such warnings for regular benchmark operations? (the error rate in the summary report should always show zero, otherwise you got some errors). Can you rerun with --on-error="abort" to ensure that you did not get any request errors?

Can you please also share the full command line that you have used to invoke Rally (please do not forget to mask your password before if you use authentication)?

In addition, you might also try to unset all HTTP proxy related env variables and use --offline to see whether the problem is gone. As all the relevant data should be cached by now on your local machine, the benchmark should run just fine.

Thanks

Daniel

Hi Daniel,

to answer your questions:

The error rate is still zero (applied --on-error="abort").
| All | error rate | scroll | 0 | % |

The command line call is:
esrally --on-error="abort" \ --offline \ --report-file=/root/.rally/report.md \ --track="${ESRALLY_TRACK}" \ --user-tag="${ESRALLY_USER_TAGS}" \ --target-hosts="${ESRALLY_TARGET_HOSTS}" \ --pipeline=benchmark-only \ --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'${ESRALLY_USERNAME}',basic_auth_password:'${ESRALLY_PASSWORD}'"

We still get:
2018-04-12 06:34:53,301 PID:14 rally.racecontrol INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:40899)] (ignoring). 2018-04-12 06:34:53,316 PID:21 rally.mechanic INFO Flushing system metrics store on host [[{'host': '1.2.3.4', 'port': 9200}, {'host': '1.2.3.5', 'port': 9200}]]. 2018-04-12 06:34:53,359 PID:14 rally.racecontrol INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:40899)] (ignoring). 2018-04-12 06:35:03,318 PID:15 elasticsearch WARNING GET https://1.2.3.4:9200/_nodes/stats/_all [status:N/A request:10.017s]

But we still have 0 (zero) stats.

curl returns a valid json result (cut out for readability):
curl -k https://1.2.3.4:9200/_nodes/stats/_all > /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 17187 100 17187 0 0 341k 0 --:--:-- --:--:-- --:--:-- 349k

Any idea?

Honestly, I have to admit that I start to run out of ideas: While the actual benchmark is running, Rally seems to be able to connect fine to the cluster but then it fails to get a response from the node stats API within the timeout (10 seconds). In both cases, we use the very same Elasticsearch client configuration.

I will prepare a minimal Python script on Monday that calls the node stats API via the Python client. Maybe that will reveal something.

Hi,

can you please:

  1. Store the following snippet as index_stats.py.
  2. Replace host, port, user and password with the appropriate values.
  3. Issue python3 index_stats.py after you have started the benchmark. After you're done you can terminate it with CTRL+C.
  4. Tell me about the results.

The script issues a index stats call every two seconds and prints "SUCCESS" if it works and the error trace if it doesn't. This is the same call that Rally issues at the end of the benchmark.

import elasticsearch
import urllib3
import time
import datetime


def main():
    # TODO: Provide parameters for your environment here
    host = "localhost"
    port = 9200
    user = "rally"
    password = "rally-password"
    # request timeout of the Elasticsearch client (default: 10 seconds)
    timeout = 10

    # avoid unverified HTTPS request warning (certificates are not validated)
    urllib3.disable_warnings()

    es = elasticsearch.Elasticsearch(hosts=[{"host": host, "port": port}],
                                     timeout=timeout,
                                     use_ssl=True,
                                     verify_certs=False,
                                     http_auth=(user, password))
    while True:
        now = datetime.datetime.now(datetime.timezone.utc)
        msg = "[{:%Y-%m-%dT%H:%M:%SZ}] Retrieving index stats... ".format(now)
        print(msg, flush=True, end="")
        try:
            es.indices.stats(metric="_all", level="shards")
            print("SUCCESS", flush=True)
        except (elasticsearch.ElasticsearchException,
                urllib3.exceptions.HTTPError) as e:
            print("ERROR", flush=True)
            print(e)
        time.sleep(2)


if __name__ == '__main__':
    main()

Daniel

Hi Daniel,

the results are:

  1. Zero stats for "All":
|   All |                   Indexing time |              |       0 |    min |
|   All |          Indexing throttle time |              |       0 |    min |
|   All |                      Merge time |              |       0 |    min |
|   All |                    Refresh time |              |       0 |    min |
|   All |                      Flush time |              |       0 |    min |
|   All |             Merge throttle time |              |       0 |    min |
|   All |              Total Young Gen GC |              |       0 |      s |
|   All |                Total Old Gen GC |              |       0 |      s |

  1. Timeouts retrieving node stats:
2018-04-17 12:20:27,194 PID:15 elasticsearch WARNING GET https://1.2.3.4:9200/_nodes/stats/_all [status:N/A request:10.037s]
  1. index_stats.py produced 2k lines of "SUCCESS". No other output.
>_ index_stats.py >> indices_stats.log
>_ tail -n 10 indices_stats.log
[2018-04-17T12:20:21Z] Retrieving index stats... SUCCESS
[2018-04-17T12:20:23Z] Retrieving index stats... SUCCESS
[2018-04-17T12:20:25Z] Retrieving index stats... SUCCESS
[2018-04-17T12:20:28Z] Retrieving index stats... SUCCESS
[2018-04-17T12:20:30Z] Retrieving index stats... SUCCESS
[2018-04-17T12:20:32Z] Retrieving index stats... SUCCESS
[2018-04-17T12:20:34Z] Retrieving index stats... SUCCESS
[2018-04-17T12:20:36Z] Retrieving index stats... SUCCESS
[2018-04-17T12:20:38Z] Retrieving index stats... SUCCESS
[2018-04-17T12:20:40Z] Retrieving index stats... SUCCESS
>_ cat indices_stats.log |grep -v "Retrieving index stats\.\.\. SUCCESS"
>_

Anything else I can try?

Maybe you can reproduce this? I created a Dockerfile:

FROM python:3

ARG VERSION_ESRALLY=0.9.4

# esrally requires JAVA
RUN echo "deb http://http.debian.net/debian jessie-backports main" >> /etc/apt/sources.list \
  && apt-get update \
  && apt-get install --yes --no-install-recommends --target-release jessie-backports openjdk-8-jre-headless \
  && /usr/sbin/update-java-alternatives -s java-1.8.0-openjdk-amd64

RUN pip install esrally=="${VERSION_ESRALLY}" \
  && esrally configure

ENTRYPOINT [ "esrally" ]
CMD [ "--help" ]

And I run it with a bash script:

#!/usr/bin/env bash
#
# Intentionally created to ease run of esrally benchmark (race).
#
###########################################################
# Run esrally to benchmark existing elasticsearch cluster.
# Globals:
#   HTTP_PROXY
#   HTTPS_PROXY
# Arguments:
#   None
# Returns:
#   None
##########################################################

readonly CONTAINER_NAME='esrally'
# esrally must not run against production since it can destroy your cluster.
readonly ESRALLY_TARGET_HOSTS='1.2.3.4:9200,1.2.3.5:9200'
readonly IMAGE="mynamespace/elastic/esrally"

# Print tracks to have a selection.
docker pull "${IMAGE}"
docker run \
  --env http_proxy="${HTTP_PROXY}" \
  --env https_proxy="${HTTPS_PROXY}" \
  --rm \
  --tty \
  --volume "${PWD}":/root/.rally \
  "${IMAGE}" list tracks

# Read credentials and options to run esrally.
echo "Insert superuser credentials to run API calls against elasticsearch."
read -p 'TRACK:' ESRALLY_TRACK
read -p 'TAGS (key:value, comma separated):' ESRALLY_USER_TAGS
read -p 'USERNAME:' ESRALLY_USERNAME
read -sp 'PASSWORD:' ESRALLY_PASSWORD

# Run the benchmark to generate a report.
echo "Kicking off detached container to run benchmark. Writing output to: ${PWD}."
docker rm --force --volumes "${CONTAINER_NAME}"
docker run \
  --detach \
  --env http_proxy="${HTTP_PROXY}" \
  --env https_proxy="${HTTPS_PROXY}" \
  --interactive \
  --name "${CONTAINER_NAME}" \
  --tty \
  --volume "${PWD}":/root/.rally \
  "${IMAGE}" \
    --on-error="abort" \
    --report-file=/root/.rally/report.md \
    --track="${ESRALLY_TRACK}" \
    --user-tag="${ESRALLY_USER_TAGS}" \
    --target-hosts="${ESRALLY_TARGET_HOSTS}" \
    --pipeline=benchmark-only \
    --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'${ESRALLY_USERNAME}',basic_auth_password:'${ESRALLY_PASSWORD}'"
echo "Done."

CU
ize0j10

Hi,

thanks for your reproduction scenario. I am sorry that it took so long to get back to you but I could still not reproduce the behaviour.

However, I have another idea what might cause this. The code that retrieves the data does a call once at the beginning of the benchmark and once at the end so maybe there is a problem with the internal connection pool of the Elasticsearch client.

Does your benchmark take a very long time (more than a day?). Also, do you see any other errors during the end of a benchmark (e.g. Could not retrieve GC times.)? This could explain the symptoms as well. I want to note though that we also run benchmarks that take more than a day to complete and we do not experience this problem so this might be the wrong route. A packet capture (e.g. with Wireshark) could maybe reveal what is going on (or rather not going on) on network level.

I doubt that this will help but you can also try to increase the read timeout e.g. to 60 seconds by adding timeout:60 to the --client-options parameter.

Daniel

Hi Daniel,

to answer your questiions I trigered a new Benchmark using the known command line plus added the timeout option.

Unfortunately with the latest Version this Benchmark does not finish at all.

The last Job is stuck at:
Running scroll [100% done]

Logfile ist full of:
2018-05-08 10:23:26,500 PID:75 rally.mechanic INFO Flushing system metrics store on host [[{'host': '1.2.3.4', 'port': 9200, 'use_ssl': True}, {'host': '1.2.3.5', 'port': 9200, 'use_ssl': True}]].

The Errors so far are:

[root@machine]# grep ERROR rally_out_20180504T050820Z.log
2018-05-04 06:24:50,643 PID:15 rally.telemetry ERROR Could not retrieve GC times.
2018-05-04 06:25:50,656 PID:15 rally.telemetry ERROR Could not retrieve index stats.
2018-05-04 06:25:50,659 PID:15 rally.actor ERROR Error in mechanic

Except now the Benchmark completes within several hours.

Cu
ize0j10

Hey @ize0j10,

I followed this thread and am also rather baffled about what's going on. I will recap my understanding below as a summary (please correct errors) and proceed with some suggestions.

My understanding is that you are using the latest released Rally (currently 0.10.1) against a two node 6.3 Elasticsearch cluster.

Are you using a custom track or one of the standard ones?

IIUC your race takes some time to complete and while the various operations complete successfully you are experiencing problems (read time outs) towards the end of the benchmark, when Rally collects index stats (and GC times).

If the track is a lengthy one, it would make sense to limit the amount of operations and see if the problem happens again.
If you are using any of the standard Rally tracks, they support a number of options that can be passed using --target-params; for example geonames supports ingest_percentage and bulk_size/bulk_indexing_clients. Reducing significantly the ingest_percentage (say to 1) should index less documents and reduce the total running time of the benchmark.

If you are using a custom track (specified with --track-path) you could modify it to just run fewer operations and index less data (again with the ingest-percentage property of the bulk operation).

Should the benchmark finish successfully we will have an indication that there is some strange network problem.

The elasticsearch python client (that Rally uses) uses persistent connections and there is a chance that something is terminating long living connections (firewall?). The client however supports retry_on_timeout; @danielmitterdorfer what do you think, would passing retry_on_timeout=True in --client-options be worth trying as this parameter will be used by the client_factory used by telemetry devices, such as IndexStats and GcTimesSummary?

Regards,
Dimitris

Hi dliappis,

Yes, we are using a standard track "http_logs" by triggering:

/usr/local/bin/python /usr/local/bin/esrally --on-error=abort --report-file=/root/.rally/report.md --track=http_logs --user-tag=elastic:6.2.4 --target-hosts=https://1.2.3.4:9200,https://1.2.3.5:9200 --pipeline=benchmark-only --client-options=timeout:60,use_ssl:true,verify_certs:false,basic_auth_user:'myuser',basic_auth_password:'mypassword'

Errors during the last run are:

[root@machine logs]# grep ERROR rally_out_20180508T103458Z.log
2018-05-08 12:08:36,285 PID:16 rally.telemetry ERROR Could not retrieve GC times.
2018-05-08 12:09:36,350 PID:16 rally.telemetry ERROR Could not retrieve index stats.
2018-05-08 12:09:36,355 PID:16 rally.actor ERROR Error in mechanic

We are running esrally using a Dockerfile running at Python 3, Java 1.8. Debian based. (See some answers before.)

Until 0.10.0 the track completes but lacks the general metrics listed at top of the report. This is why I started this thread.

I just kicked off a benchmark using 0.11.0 to report the results and see if the issues still exist.

CU
ize0j10

Hi,

esrally 0.11.0 completes. The observations are:

  • runtime: 2hours 15min
  • log:

ERRORS:

2018-05-14 10:13:06,720 PID:15 rally.telemetry ERROR Could not retrieve GC times.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/usr/local/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.6/ssl.py", line 1009, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.6/ssl.py", line 871, in read
    return self._sslobj.read(len, buffer)
  File "/usr/local/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
2018-05-14 10:14:06,788 PID:15 rally.telemetry ERROR Could not retrieve index stats.

WARNINGS

2018-05-14 08:08:19,739 PID:14 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:20,649 PID:75 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:20,843 PID:15 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:21,167 PID:15 rally.telemetry WARNING merges_total_time is 13492165 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-05-14 08:08:21,167 PID:15 rally.telemetry WARNING merges_total_throttled_time is 3486818 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-05-14 08:08:21,167 PID:15 rally.telemetry WARNING indexing_total_time is 8821608 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-05-14 08:08:21,167 PID:15 rally.telemetry WARNING refresh_total_time is 7000127 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-05-14 08:08:21,167 PID:15 rally.telemetry WARNING flush_total_time is 17180 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
2018-05-14 08:08:21,193 PID:80 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:21,201 PID:78 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:21,203 PID:81 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:21,203 PID:79 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:21,219 PID:82 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:21,222 PID:84 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:21,230 PID:83 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:08:21,235 PID:85 rally.client WARNING User has enabled SSL but disabled certificate verification. This is dangerous but may be ok for a benchmark. Disabling urllib warnings now to avoid a logging storm. See https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings for details.
2018-05-14 08:09:47,747 PID:81 elasticsearch WARNING POST https://1.2.3.5:9200/_bulk [status:N/A request:60.034s]
2018-05-14 08:09:47,755 PID:81 elasticsearch WARNING Connection <ConfigurableHttpConnection: https://1.2.3.5:9200> has failed for 1 times in a row, putting on 60 second timeout.
2018-05-14 10:13:06,712 PID:15 elasticsearch WARNING GET https://1.2.3.4:9200/_nodes/stats/_all [status:N/A request:60.063s]
2018-05-14 10:14:06,787 PID:15 elasticsearch WARNING GET https://1.2.3.4:9200/_stats/_all?level=shards [status:N/A request:60.061s]
  • (Please note, we evaluated already that during the time of benchmark the APIs at https://1.2.3.4:9200/_nodes/stats/_all and https://1.2.3.5:9200/_nodes/stats/_all are available and accessable.)
  • In the current report the metrics "Indexing time", "Indexing throttle time", "Merge time", "Refresh time", "Flush time" and "Merge throttle time" are not showing up (did you remove them from the report?).
  • The values for "Total Young Gen GC" and "Total Old Gen GC" have zero values - error rate is zero as well.
  • All other metrics have non zero values.

CU
ize0j10

I still wonder whether the hunch of @dliappis with the network problem (e.g. network connection terminated) might guide us into the right direction. Rally just issues one request at the start and one request at the end of the benchmark to retrieve the metrics in question. If there are two hours in between, some component might indeed terminate the connection.

Can you please try to reduce the total benchmark duration to just a few minutes? You can run with --challenge="append-no-conflicts-index-only" --track-params="ingest_percentage:10". This will run an indexing workload (without queries) and only ingest 10% of the corpus.

Daniel

Hi Daniel,

if there really is zero traffic through the connections established it will be the case that our firewall disconnects the connection.

I think there should be a health-check and a reconnect when rally picks up a connection from the pool.

However applying --challenge="append-no-conflicts-index-only" --track-params="ingest_percentage:10" provides the metrics in question.

Is there a change you implement a behavior that reconnects once a network connection is no longer established?

Thanks,
ize0j10

Hey @ize0j10,

Can you try passing retry_on_timeout=True in your --client-options and re-run? This was described in my earlier comment, last paragraph and might trigger the Elasticsearch Python client to re-establish a connection if there's a timeout.

Hi,

passing retry_on_timeout:true to the --client-options re-established the connection and the metrics in question are non zero any more!

For me this issue is solved but for others it might not be obvious that the retry option is not default. I hereby vote to make this default.

Thanks four your help!

CU
ize0j10

1 Like

@ize0j10 I am glad this sorted you out. I opened https://github.com/elastic/rally/issues/502 to use this option by default for telemetry devices.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.