About running Rally with error

I just try to run rally and benchmark two clusters respectively.
The first cluster is runned as es 2.3.5 with searchguard and running on the same server which is also running esrally. Everything is OK.
However, when I run the esrally to benchmark the second remote cluster which runned as 2.3.2 without searchguard, the following error occurred.
AND my cmd is like following:

esrally --track=geonames --offline --pipeline=benchmark-only --target-hosts=192.168.1.1:9200

Error is like following:

2016-10-18 10:10:46,498 rally.driver ERROR Cluster did not reach status [green]. Last reached status: [green]
2016-10-18 10:10:46,501 rally.telemetry INFO Benchmark stop
2016-10-18 10:10:46,501 rally.telemetry INFO Gathering nodes stats
2016-10-18 10:10:52,452 rally.telemetry INFO Gathering indices stats
2016-10-18 10:10:55,195 rally.telemetry WARNING Could not determine metric [segments_points_memory_in_bytes] at path [segments,points_memory_in_bytes].
2016-10-18 10:10:55,232 root ERROR Cannot run subcommand [race].
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/esrally-0.4.1-py3.5.egg/esrally/racecontrol.py", line 364, in run
    pipeline()
  File "/usr/local/lib/python3.5/site-packages/esrally-0.4.1-py3.5.egg/esrally/racecontrol.py", line 60, in __call__
    step()
  File "/usr/local/lib/python3.5/site-packages/esrally-0.4.1-py3.5.egg/esrally/racecontrol.py", line 29, in __call__
    self.command(self.ctx)
  File "/usr/local/lib/python3.5/site-packages/esrally-0.4.1-py3.5.egg/esrally/racecontrol.py", line 188, in benchmark_external
    raise exceptions.RallyError("Driver has returned no metrics but instead [%s]. Terminating race without result." % str(completed))
esrally.exceptions.RallyError: Driver has returned no metrics but instead [Poison<<esrally.driver.StartBenchmark object at 0x7f4d45a93dd8>>]. Terminating race without result.

Is that something wrong with cmd?

Thanks

Hi @leozkl1,

can you please update to the latest version of Rally (which is 0.4.3). Error reporting got much better in the latest version for these cases. I don't know what you mean by "searchguard". Can you please elaborate?

Daniel

Hi Daniel:

Thanks for your reply, my esrally version has been already at 0.4.3, however I have workout this error by closeing relocating shards.
And I have some problem about the report:

              Indexing time                           58.0136        49.61       -8.40363     min
                 Merge time                           47.5014      20.4587      -27.04270     min
               Refresh time                           6.76597      6.11877       -0.64720     min
                 Flush time                          0.844633     0.708417       -0.13622     min
        Merge throttle time                           2.81232      2.52082       -0.29150     min
         Total Young Gen GC                           176.882       1432.4    +1255.52100       s
           Total Old Gen GC                            23.105       28.379       +5.27400       s
              Segment count                               156          179      +23.00000
             Min Throughput          index-append     41747.8      21206.4   -20541.46094  docs/s
          Median Throughput          index-append     56149.7      56749.3     +599.61914  docs/s
             Max Throughput          index-append     59739.1      62857.6    +3118.48047  docs/s
  90.0th percentile latency          index-append     783.063      750.836      -32.22756      ms
  99.0th percentile latency          index-append     930.486      827.449     -103.03735      ms
 100.0th percentile latency          index-append     1018.55      873.845     -144.70099      ms

90.0th percentile service time index-append 783.063 750.836 -32.22756 ms
99.0th percentile service time index-append 930.486 827.449 -103.03735 ms
100.0th percentile service time index-append 1018.55 873.845 -144.70099 ms

  1. What is the definition of 100.0th percentile latency, the value of "index-append" operarion in the report is over 1s.
  2. The indexing time is about 58mins, but the total test just takes 30mins, why does this happen?
  3. I cannot find disk io key in neither the report nor elasticsearch, this is different from the reference docs.
  4. Does detailed definition of metrcs or operations described in the reference docs?

By the way, searchguard is a opensource security plugin for elasticsearch, just like what shield does.

Thanks!

Hi @leozkl1,

the 100.0th percentile is the maximum value that has been encountered. So this means that one bulk index operation took at most 1018.55ms or 1.019s.

The indexing time is not wall clock time but total time spent indexing. For example, if you have 4 indexing threads that each index for 10 minutes (in parallel), 10 minutes have elapsed but the 4 indexing threads have spent 4 * 10 minutes = 40 minutes total. Makes sense?

The metrics key is called disk_io_write_bytes (see docs). You can query the last 10 values with the following query:

GET /rally-*/metrics/_search
{
   "query": {
      "term": {
         "name": {
            "value": "disk_io_write_bytes"
         }
      }
   },
   "size": 10,
   "sort": [
      {
         "trial-timestamp": {
            "order": "desc"
         }
      }
   ]
}

On my machine a hit looks like this for example:

{
            "_index": "rally-2016",
            "_type": "metrics",
            "_id": "AVfXlVqoHtkxNiBSdqV6",
            "_score": null,
            "_source": {
               "@timestamp": 1476790599815,
               "environment": "local",
               "name": "disk_io_write_bytes",
               "relative-time": 3272043053,
               "trial-timestamp": "20161018T104207Z",
               "meta": {
                  "host_name": "taz",
                  "os_name": "Linux",
                  "node_name": "rally-node0",
                  "jvm_version": "1.8.0_102",
                  "cpu_logical_cores": 8,
                  "source_revision": "13e62e1",
                  "cpu_physical_cores": 4,
                  "cpu_model": "Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz",
                  "jvm_vendor": "Oracle Corporation",
                  "os_version": "4.7.4-1-ARCH"
               },
               "value": 347668783104,
               "sample-type": "normal",
               "challenge": "append-no-conflicts",
               "car": "defaults",
               "track": "nyc_taxis",
               "unit": "byte"
            },
            "sort": [
               1476787327000
            ]
         }

Metrics are defined in the reference docs. For operations you should look directly into the track repository (look in track.json of the track you're running) but you can choose any name for an operation. Are you missing something?

Didn't know that. Thanks for the info. I guess your problem then was that Rally did not authenticate at the cluster. You can use --client-options to pass username and password (see the examples in the docs; it's basically: --client-options="basic_auth_user:'user',basic_auth_password:'password',timeout:60000,request_timeout:60000"assuming that you don't use SSL, otherwise: please look in the docs).

Daniel