Increase data size in Rally existing tracks

Hi Daniel ,

I am using Rally's existing tracks to perform benchmarking. I noticed that nyc taxis is the largest track with 4.5GB compressed and 74.3 GB uncompressed docs. I want to test with larger data volume. Is there any option provided in rally to duplicate or triplicate the data in existing tracks ?

1 Like

Hi @Alp1,

you can apply a trick so Rally indexes the data into multiple indices but you need to create your own track for that. I suggest that you use the latest version (which is 0.9.1) because we introduced a concept of "document corpora" recently with Rally 0.9.0. This feature allows you to reuse document corpora from other tracks. Here is a complete example that bulk-indexes the nyc_taxis document corpus ten times (note the index_count variable at the top):

{% set index_count = 10 %}
{
  "version": 2,
  "description": "Taxi rides in New York in 2015",
  "indices": [
  {% set comma = joiner() %}
  {% for item in range(index_count) %}
  {{ comma() }}
    {
      "name": "nyc_taxis-{{item}}",
      "body": "index.json",
      "types": [ "type" ],
      "auto-managed": false
    }
  {% endfor %}
  ],
  "corpora": [
    {
      "name": "nyc_taxis",
      "base-url": "http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/nyc_taxis",
      "documents": [
      {% set comma = joiner() %}
      {% for item in range(index_count) %}
      {{ comma() }}
        {
          "target-index": "nyc_taxis-{{item}}",
          "target-type": "type",
          "source-file": "documents.json.bz2",
          "document-count": 165346692,
          "compressed-bytes": 4812721501,
          "uncompressed-bytes": 79802445255
        }
      {% endfor %}
      ]
    }
  ],
  "challenge": {
      "name": "bulk-index",
      "schedule": [
        {
          "operation": "delete-index"
        },
        {
          "operation": {
            "operation-type": "create-index",
            "settings": {
              "index.number_of_replicas": 0
            }
          }
        },
        {
          "name": "check-cluster-health",
          "operation": {
            "operation-type": "cluster-health",
            "index": "nyc_taxis-*",
            "request-params": {
              "wait_for_status": "{{cluster_health | default('green')}}",
              "wait_for_no_relocating_shards": "true"
            }
          }
        },
        {
          "operation": {
            "name": "index-append",
            "operation-type": "bulk",
            "bulk-size": {{bulk_size | default(10000)}}
          },
          "clients": 8,
          "warmup-time-period": 0
        },
        {
          "operation": "refresh",
          "clients": 1
        },
        {
          "operation": "force-merge",
          "clients": 1
        }
      ]
    }
}

Store this as e.g. nyc_taxis.json and run (e.g.) with esrally --distribution-version=6.1.1 --on-error=abort --track-path=/path/to/nyc_taxis.json but note that you also need to store the index definition from https://github.com/elastic/rally-tracks/blob/master/nyc_taxis/index.json in the same directory as nyc_taxis.json in order to make this work.

Alternatively you can also use the eventdata track from Christian Dahlqvist which uses generated data but allows you to create arbitrarily large indices.

1 Like

Perfect ..Thanks Daniel.
I will try this trick and will keep you posted.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.