Increase data size in Rally existing tracks


#1

Hi Daniel ,

I am using Rally's existing tracks to perform benchmarking. I noticed that nyc taxis is the largest track with 4.5GB compressed and 74.3 GB uncompressed docs. I want to test with larger data volume. Is there any option provided in rally to duplicate or triplicate the data in existing tracks ?


(Daniel Mitterdorfer) #2

Hi @Alp1,

you can apply a trick so Rally indexes the data into multiple indices but you need to create your own track for that. I suggest that you use the latest version (which is 0.9.1) because we introduced a concept of "document corpora" recently with Rally 0.9.0. This feature allows you to reuse document corpora from other tracks. Here is a complete example that bulk-indexes the nyc_taxis document corpus ten times (note the index_count variable at the top):

{% set index_count = 10 %}
{
  "version": 2,
  "description": "Taxi rides in New York in 2015",
  "indices": [
  {% set comma = joiner() %}
  {% for item in range(index_count) %}
  {{ comma() }}
    {
      "name": "nyc_taxis-{{item}}",
      "body": "index.json",
      "types": [ "type" ],
      "auto-managed": false
    }
  {% endfor %}
  ],
  "corpora": [
    {
      "name": "nyc_taxis",
      "base-url": "http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/nyc_taxis",
      "documents": [
      {% set comma = joiner() %}
      {% for item in range(index_count) %}
      {{ comma() }}
        {
          "target-index": "nyc_taxis-{{item}}",
          "target-type": "type",
          "source-file": "documents.json.bz2",
          "document-count": 165346692,
          "compressed-bytes": 4812721501,
          "uncompressed-bytes": 79802445255
        }
      {% endfor %}
      ]
    }
  ],
  "challenge": {
      "name": "bulk-index",
      "schedule": [
        {
          "operation": "delete-index"
        },
        {
          "operation": {
            "operation-type": "create-index",
            "settings": {
              "index.number_of_replicas": 0
            }
          }
        },
        {
          "name": "check-cluster-health",
          "operation": {
            "operation-type": "cluster-health",
            "index": "nyc_taxis-*",
            "request-params": {
              "wait_for_status": "{{cluster_health | default('green')}}",
              "wait_for_no_relocating_shards": "true"
            }
          }
        },
        {
          "operation": {
            "name": "index-append",
            "operation-type": "bulk",
            "bulk-size": {{bulk_size | default(10000)}}
          },
          "clients": 8,
          "warmup-time-period": 0
        },
        {
          "operation": "refresh",
          "clients": 1
        },
        {
          "operation": "force-merge",
          "clients": 1
        }
      ]
    }
}

Store this as e.g. nyc_taxis.json and run (e.g.) with esrally --distribution-version=6.1.1 --on-error=abort --track-path=/path/to/nyc_taxis.json but note that you also need to store the index definition from https://github.com/elastic/rally-tracks/blob/master/nyc_taxis/index.json in the same directory as nyc_taxis.json in order to make this work.

Alternatively you can also use the eventdata track from Christian Dahlqvist which uses generated data but allows you to create arbitrarily large indices.


Is it possible to create a loop around a challenge
#3

Perfect ..Thanks Daniel.
I will try this trick and will keep you posted.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.