How to calculate the throughput of parallel indexing?

There is a test scenario for me, the configuration file is (track.json):

{% if index_count is not defined %}
{% set index_count = 1 %}
{% endif %}
{% set index_prefix = "billions-jssz.benchmark.scene1-" %}
{% import "rally.helpers" as rally with context %}
{
  "version": 2,
  "description": "throughput with {{index_count}} index",
  "indices": [
    {% set comma = joiner() %}
    {% for item in range(index_count) %}
    {{ comma() }}
    {
      "name": "{{index_prefix}}{{item}}",
      "body": "settings/mapping.json",
      "types": [ "logs" ]
    }
    {% endfor %}
  ],
  "corpora": [
    {
      "name": "jssz_scene_1",
      "target-type": "logs",
      "documents": [
        {% set comma = joiner() %}
        {% for item in range(index_count) %}
        {{ comma() }}
        {
          "target-index": "{{index_prefix}}{{item}}",
          "source-file": "{{index_source_file | default('documents.json')}}",
          "document-count": {{index_document_count | default(1)}},
          "uncompressed-bytes": {{index_document_uncompressed_bytes | default(1)}}
        }
        {% endfor %}
      ]
    }
  ],
  "challenges": [
    {
      "name": "{{challenge_name | default('just_bulk')}}",
      "default": true,
      "schedule": [
        {
          "operation": {
            "operation-type": "delete-index"
          }
        },
        {
          "operation": {
            "operation-type": "create-index"
          }
        },
        {
          "parallel": {
            "tasks": [
              {% set comma = joiner() %}
              {% for item in range(index_count) %}
              {{ comma() }}
              {
                "warmup-time-period": {{ bulk_warmup_time_period | default(0) }},
                "operation": {
                  "name": "{{challenge_name}}-bulk-{{item}}",
                  "operation-type": "bulk",
                  "bulk-size": {{ bulk_size | default(200) }},
                  "indices": ["{{index_prefix}}{{item}}"]
                },
                "clients": {{ bulk_clients | default(5) }}
              }
              {% endfor %}
            ]
          }
        }
      ]
    }
  ]
}

There are 20 indices, and every index will be indexing with a bulk operation(because I want to test the throughput with many indices, and they should be indexed parallelly). After the race, the summary report just like:


It seems like every bulk operation get a throughput, how can I calculate the total concurrency? Simple addition operation?

With this setup you will get a number of Rally processes that each index into a single index. If this is how you eventually will index data, e.g. through separate Logstash pipelines, this will be a reasonably accurate simulation.

If you however are likely to have bulk requests containing data for multiple indices at the same time, I would probably take a somewhat different approach to better simulate that scenario. Is this the case?

The scenario is:
It will generate 20 indices,like index-0,index-1...index-19.
And every index has a bulk request with some data, like bulk-0 for index-0, bulk-1 for index-1... bulk-19 for index-19.All the bulk requests will be executed at the same time.

In this way, every bulk request will get a throughtput report, like throughtput-0 for bulk-0,throughtput-1 for bulk-1...
So, how can I calculate the real sum throughput?

Summing up averages I guess will give you an approximation, but may not be very exact. The best way would be to set up an Elasticsearch instance as a metrics store and then use Kibana to analyse the detailed results written there.

The questions I asked were however aimed at checking whether you are benchmarking the right thing or not, as there might be other ways to structure the track that directly would give you the results you are looking for.

One way would be to create indices using the create index operation. There is then an option to have bulk request headers present in the data file (see includes-action-and-meta-data in the docs. This would allow you to mix up documents, which would result in mixed bulk requests could be sent using a single bulk operation, which would make the stats a lot easier to interpret while at the same time potentially also make the benchmark more realistic.

The real purpose of this scenario was I want know the cluster throughput when multiple indices are written at the same time, so I designed this scenario.
Do you think this scenario is wrong? I see the offical demo http_logs also has many indices, but just has one bulk operation to write in serial

this is a good idea!!!:+1:

It depends on how you are going to write to these indices. If you will write bulk requests each containing events for just one index, your approach is fine. If you on the other hand will be sending mixed bulk requests (which your config does not) you may not get a realistic benchmark as you in a real scenario might be more likely to experience bulk rejections.

If the data you write to all indices is uniform, a possibly simpler way to simulate this could be to create a single index with the same number of shards as all your indices have. Indexing into one index with X shards should be close to indexing into Y indices each with X/Y shards.

I am not sure about this way, but I will try it and the includes-action-and-meta-data mentioned above.

Thanks a lot !

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.