Hey @ize0j10,
I followed this thread and am also rather baffled about what's going on. I will recap my understanding below as a summary (please correct errors) and proceed with some suggestions.
My understanding is that you are using the latest released Rally (currently 0.10.1
) against a two node 6.3 Elasticsearch cluster.
Are you using a custom track or one of the standard ones?
IIUC your race takes some time to complete and while the various operations complete successfully you are experiencing problems (read time outs) towards the end of the benchmark, when Rally collects index stats (and GC times).
If the track is a lengthy one, it would make sense to limit the amount of operations and see if the problem happens again.
If you are using any of the standard Rally tracks, they support a number of options that can be passed using --target-params; for example geonames
supports ingest_percentage
and bulk_size
/bulk_indexing_clients
. Reducing significantly the ingest_percentage
(say to 1
) should index less documents and reduce the total running time of the benchmark.
If you are using a custom track (specified with --track-path
) you could modify it to just run fewer operations and index less data (again with the ingest-percentage
property of the bulk operation).
Should the benchmark finish successfully we will have an indication that there is some strange network problem.
The elasticsearch python client (that Rally uses) uses persistent connections and there is a chance that something is terminating long living connections (firewall?). The client however supports retry_on_timeout
; @danielmitterdorfer what do you think, would passing retry_on_timeout=True
in --client-options
be worth trying as this parameter will be used by the client_factory used by telemetry devices, such as IndexStats
and GcTimesSummary
?
Regards,
Dimitris