How to calculate the throughput of parallel indexing?

Todd · September 20, 2018, 2:43am

There is a test scenario for me, the configuration file is (track.json):

{% if index_count is not defined %}
{% set index_count = 1 %}
{% endif %}
{% set index_prefix = "billions-jssz.benchmark.scene1-" %}
{% import "rally.helpers" as rally with context %}
{
  "version": 2,
  "description": "throughput with {{index_count}} index",
  "indices": [
    {% set comma = joiner() %}
    {% for item in range(index_count) %}
    {{ comma() }}
    {
      "name": "{{index_prefix}}{{item}}",
      "body": "settings/mapping.json",
      "types": [ "logs" ]
    }
    {% endfor %}
  ],
  "corpora": [
    {
      "name": "jssz_scene_1",
      "target-type": "logs",
      "documents": [
        {% set comma = joiner() %}
        {% for item in range(index_count) %}
        {{ comma() }}
        {
          "target-index": "{{index_prefix}}{{item}}",
          "source-file": "{{index_source_file | default('documents.json')}}",
          "document-count": {{index_document_count | default(1)}},
          "uncompressed-bytes": {{index_document_uncompressed_bytes | default(1)}}
        }
        {% endfor %}
      ]
    }
  ],
  "challenges": [
    {
      "name": "{{challenge_name | default('just_bulk')}}",
      "default": true,
      "schedule": [
        {
          "operation": {
            "operation-type": "delete-index"
          }
        },
        {
          "operation": {
            "operation-type": "create-index"
          }
        },
        {
          "parallel": {
            "tasks": [
              {% set comma = joiner() %}
              {% for item in range(index_count) %}
              {{ comma() }}
              {
                "warmup-time-period": {{ bulk_warmup_time_period | default(0) }},
                "operation": {
                  "name": "{{challenge_name}}-bulk-{{item}}",
                  "operation-type": "bulk",
                  "bulk-size": {{ bulk_size | default(200) }},
                  "indices": ["{{index_prefix}}{{item}}"]
                },
                "clients": {{ bulk_clients | default(5) }}
              }
              {% endfor %}
            ]
          }
        }
      ]
    }
  ]
}

There are 20 indices, and every index will be indexing with a bulk operation(because I want to test the throughput with many indices, and they should be indexed parallelly). After the race, the summary report just like:

It seems like every bulk operation get a throughput, how can I calculate the total concurrency? Simple addition operation？

Christian_Dahlqvist · September 20, 2018, 6:31am

With this setup you will get a number of Rally processes that each index into a single index. If this is how you eventually will index data, e.g. through separate Logstash pipelines, this will be a reasonably accurate simulation.

If you however are likely to have bulk requests containing data for multiple indices at the same time, I would probably take a somewhat different approach to better simulate that scenario. Is this the case?

Todd · September 20, 2018, 7:04am

The scenario is:
It will generate 20 indices,like index-0,index-1...index-19.
And every index has a bulk request with some data, like bulk-0 for index-0, bulk-1 for index-1... bulk-19 for index-19.All the bulk requests will be executed at the same time.

In this way, every bulk request will get a throughtput report, like throughtput-0 for bulk-0,throughtput-1 for bulk-1...
So, how can I calculate the real sum throughput?

Christian_Dahlqvist · September 20, 2018, 7:11am

Summing up averages I guess will give you an approximation, but may not be very exact. The best way would be to set up an Elasticsearch instance as a metrics store and then use Kibana to analyse the detailed results written there.

The questions I asked were however aimed at checking whether you are benchmarking the right thing or not, as there might be other ways to structure the track that directly would give you the results you are looking for.

One way would be to create indices using the create index operation. There is then an option to have bulk request headers present in the data file (see includes-action-and-meta-data in the docs. This would allow you to mix up documents, which would result in mixed bulk requests could be sent using a single bulk operation, which would make the stats a lot easier to interpret while at the same time potentially also make the benchmark more realistic.

Todd · September 20, 2018, 7:27am

The real purpose of this scenario was I want know the cluster throughput when multiple indices are written at the same time, so I designed this scenario.
Do you think this scenario is wrong? I see the offical demo http_logs also has many indices, but just has one bulk operation to write in serial

Todd · September 20, 2018, 7:28am

this is a good idea!!!

Christian_Dahlqvist · September 20, 2018, 7:29am

It depends on how you are going to write to these indices. If you will write bulk requests each containing events for just one index, your approach is fine. If you on the other hand will be sending mixed bulk requests (which your config does not) you may not get a realistic benchmark as you in a real scenario might be more likely to experience bulk rejections.

Christian_Dahlqvist · September 20, 2018, 7:32am

If the data you write to all indices is uniform, a possibly simpler way to simulate this could be to create a single index with the same number of shards as all your indices have. Indexing into one index with X shards should be close to indexing into Y indices each with X/Y shards.

Todd · September 20, 2018, 7:40am

I am not sure about this way, but I will try it and the includes-action-and-meta-data mentioned above.

Thanks a lot !

system · October 18, 2018, 7:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How Write throughput is calculated in Rally Elasticsearch rally	4	1006	August 6, 2019
No throughput metrics available for [bulk]. Likely cause: The benchmark ended already during warmup Elasticsearch rally	10	1917	October 4, 2018
How can I create multi-index by ---track-params? Elasticsearch rally	3	594	October 10, 2018
Rally throughput counter include data generation Elasticsearch rally	3	406	July 19, 2021
Esrally ingesting to two indices parallelly Elasticsearch rally	6	726	August 26, 2021

How to calculate the throughput of parallel indexing?

Related topics