Split source-file into many indicies

vmasarik · April 10, 2019, 8:46am

Hi,
I have a case where I would like to split one huge source-file document into many (hundreds) indices. Could Rally help me? I might know why Rally would not support this, but I would be happy to be wrong.

My dream solution would be following: somehow tell Rally to create 300 indices (Rally will generate the names) then split document.json into all created indices.

Does Rally support this option or do I have to manually create 300 indices and split the document into 300 parts and then manually assign each document, in the corpora, its target index?

Thanks

dliappis · April 10, 2019, 6:29pm

Hello @vmasarik,

Rally can't do this automatically for you, however, there are ways you can achieve it.

The easiest approach is to create a customized document.json file with a separate action and metadata line. When you reference the json file in your corpora section you explicitly need to inform Rally that it includes action-metadata using "includes-action-and-meta-data": true.

I used the example track from the official docs and modified the toJSON.py script as shown in this gist; you can see in the script the variables INDEX_FIRST=0 and INDEX_LAST=299 that are used later to create the necessary action-and-metadata lines.

Running it creates a documents.json file like:

$ head -5 documents.json
{"index": {"_index": "geonames-054"}}
{"geonameid": 2986043, "name": "Pic de Font Blanca", "latitude": 42.64991, "longitude": 1.53335, "country_code": "AD", "population": 0}
{"index": {"_index": "geonames-068"}}
{"geonameid": 2994701, "name": "Roc Mélé", "latitude": 42.58765, "longitude": 1.74028, "country_code": "AD", "population": 0}
{"index": {"_index": "geonames-037"}}

Then I modified the track.json example in the docs to use a jinja2 loop to create 300 indices like geonames-000 ... geonames-299; the modified code is in this gist.

Then I simply ran the commands listed in this gist i.e. create the documents.json with action and metadata lines and then executed Rally using:

esrally --distribution-version=7.0.0 --track-path=$PWD

which ended up creating the specified 300 indices and split the documents.json across them.

Another approach would be using a custom parameter source. This is more complicated and you can see an example in the bulk custom parameter source of the eventdata-track.

Regards,
Dimitris

vmasarik · April 11, 2019, 3:16pm

@dliappis

Well, I never expected a response in such a detail. Huge thanks Dimitris!

One more thing before I try this out. Is Rally able to evaluate the jinja2 on its own? Or do I have to preprocess it myself before using it?

Thank you!

dliappis · April 11, 2019, 3:56pm

Yes Rally is capable of translating jinja2 by itself; in fact we are using this feature in our official tracks too (to organize things better), e.g. in https://github.com/elastic/rally-tracks/blob/master/geonames/track.json.

vmasarik · April 12, 2019, 4:37pm

@dliappis

I tried to adapt your example to my situation. However, that did not work. As, after successful completion Rally announces:
error rate | bulk | 100 | % |
Which results into 300 empty indices.

After trying many different configurations and combinations of them I tried to copy pasta your example and that still resulted into 100% error rate. Which makes me think it is a versioning issue but I don't know how to deal with it. Mainly because I am not sure how to debug this. Any tips would be appreciated

I did not mention my environment as I never thought it would be that crucial. So, I am testing a remote cluster, which has version 5.6.13 ElasticSearch. I do not know what else might be important though.

Command:

esrally --track-path=$HOME --target-hosts=elasticsearch --pipeline=benchmark-only

Data that I use:

$ head docs.json
{"index": {"_index": "nasatre", "_type": "docs", "_id": "1"}}
{"ip": "199.72.81.55", "date": "[01/Jul/1995:00:00:01 -0400]", "request": "GET /history/apollo/ HTTP/1.0", "result": 200, "size": 6245}
{"index": {"_index": "nasaone", "_type": "docs", "_id": "2"}}
{"ip": "unicomp6.unicomp.net", "date": "[01/Jul/1995:00:00:06 -0400]", "request": "GET /shuttle/countdown/ HTTP/1.0", "result": 200, "size": 3985}
{"index": {"_index": "nasatre", "_type": "docs", "_id": "3"}}
{"ip": "199.120.110.21", "date": "[01/Jul/1995:00:00:09 -0400]", "request": "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0", "result": 200, "size": 4085}

The tack.json:

{
  "version": 2,
  "description": "Desc of a track.",
  "indices": [
      {
        "name": "nasaone",
        "body": "index.json",
        "types": [ "docs" ]
      },
      {
        "name": "nasatwo",
        "body": "index.json",
        "types": [ "docs" ]
      },
      {
        "name": "nasatre",
        "body": "index.json",
        "types": [ "docs" ]
      }
  ],
  "corpora": [
    {
      "name": "nasa",
      "documents": [
        {
          "source-file": "docs.json",
          "includes-action-and-meta-data": true,
          "document-count": 1050
        }
      ]
    }
  ],
  "schedule": [
    {
      "operation": {
        "operation-type": "delete-index"
      }
    },
    {
      "operation": {
        "operation-type": "create-index"
      }
    },
    {
      "operation": {
        "operation-type": "cluster-health",
        "request-params": {
          "wait_for_status": "green"
        }
      }
    },
    {
      "operation": {
        "operation-type": "bulk",
        "bulk-size": 5000
      }
    }
  ]
}

danielmitterdorfer · April 14, 2019, 7:03pm

Hi,

you can add the command line parameter --on-error=abort when starting your benchmark. Then Rally will abort on the first erroneous request with an error message what went wrong. This should hopefully help you to diagnose the problem.

Daniel

vmasarik · April 18, 2019, 12:34pm

@danielmitterdorfer
Thanks a lot, I was able to solve the problem using that parameter.

system · May 16, 2019, 12:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cannot specify multiple documents in a single corpora Elasticsearch rally	6	93	June 26, 2024
Index different source files to different indices? Elasticsearch rally	2	817	April 13, 2018
Custom parameter sources Elasticsearch rally	4	753	November 11, 2019
Parallel Bulk from multiple source files Elasticsearch rally	2	696	September 17, 2019
Is it possible to parametrize index in rally? Elasticsearch rally	5	559	April 28, 2020

Split source-file into many indicies

Related topics