Split source-file into many indicies

Hi,
I have a case where I would like to split one huge source-file document into many (hundreds) indices. Could Rally help me? I might know why Rally would not support this, but I would be happy to be wrong.

My dream solution would be following: somehow tell Rally to create 300 indices (Rally will generate the names) then split document.json into all created indices.

Does Rally support this option or do I have to manually create 300 indices and split the document into 300 parts and then manually assign each document, in the corpora, its target index?

Thanks

Hello @vmasarik,

Rally can't do this automatically for you, however, there are ways you can achieve it.

The easiest approach is to create a customized document.json file with a separate action and metadata line. When you reference the json file in your corpora section you explicitly need to inform Rally that it includes action-metadata using "includes-action-and-meta-data": true.

I used the example track from the official docs and modified the toJSON.py script as shown in this gist; you can see in the script the variables INDEX_FIRST=0 and INDEX_LAST=299 that are used later to create the necessary action-and-metadata lines.

Running it creates a documents.json file like:

$ head -5 documents.json
{"index": {"_index": "geonames-054"}}
{"geonameid": 2986043, "name": "Pic de Font Blanca", "latitude": 42.64991, "longitude": 1.53335, "country_code": "AD", "population": 0}
{"index": {"_index": "geonames-068"}}
{"geonameid": 2994701, "name": "Roc Mélé", "latitude": 42.58765, "longitude": 1.74028, "country_code": "AD", "population": 0}
{"index": {"_index": "geonames-037"}}

Then I modified the track.json example in the docs to use a jinja2 loop to create 300 indices like geonames-000 ... geonames-299; the modified code is in this gist.

Then I simply ran the commands listed in this gist i.e. create the documents.json with action and metadata lines and then executed Rally using:

esrally --distribution-version=7.0.0 --track-path=$PWD

which ended up creating the specified 300 indices and split the documents.json across them.

Another approach would be using a custom parameter source. This is more complicated and you can see an example in the bulk custom parameter source of the eventdata-track.

Regards,
Dimitris

@dliappis

Well, I never expected a response in such a detail. Huge thanks Dimitris!

One more thing before I try this out. Is Rally able to evaluate the jinja2 on its own? Or do I have to preprocess it myself before using it?

Thank you!

1 Like

Yes Rally is capable of translating jinja2 by itself; in fact we are using this feature in our official tracks too (to organize things better), e.g. in https://github.com/elastic/rally-tracks/blob/master/geonames/track.json.

@dliappis

I tried to adapt your example to my situation. However, that did not work. As, after successful completion Rally announces:
error rate | bulk | 100 | % |
Which results into 300 empty indices.

After trying many different configurations and combinations of them I tried to copy pasta your example and that still resulted into 100% error rate. Which makes me think it is a versioning issue but I don't know how to deal with it. Mainly because I am not sure how to debug this. Any tips would be appreciated :slight_smile:

I did not mention my environment as I never thought it would be that crucial. So, I am testing a remote cluster, which has version 5.6.13 ElasticSearch. I do not know what else might be important though.

Command:

esrally --track-path=$HOME --target-hosts=elasticsearch --pipeline=benchmark-only

Data that I use:

$ head docs.json
{"index": {"_index": "nasatre", "_type": "docs", "_id": "1"}}
{"ip": "199.72.81.55", "date": "[01/Jul/1995:00:00:01 -0400]", "request": "GET /history/apollo/ HTTP/1.0", "result": 200, "size": 6245}
{"index": {"_index": "nasaone", "_type": "docs", "_id": "2"}}
{"ip": "unicomp6.unicomp.net", "date": "[01/Jul/1995:00:00:06 -0400]", "request": "GET /shuttle/countdown/ HTTP/1.0", "result": 200, "size": 3985}
{"index": {"_index": "nasatre", "_type": "docs", "_id": "3"}}
{"ip": "199.120.110.21", "date": "[01/Jul/1995:00:00:09 -0400]", "request": "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0", "result": 200, "size": 4085}

The tack.json:

{
  "version": 2,
  "description": "Desc of a track.",
  "indices": [
      {
        "name": "nasaone",
        "body": "index.json",
        "types": [ "docs" ]
      },
      {
        "name": "nasatwo",
        "body": "index.json",
        "types": [ "docs" ]
      },
      {
        "name": "nasatre",
        "body": "index.json",
        "types": [ "docs" ]
      }
  ],
  "corpora": [
    {
      "name": "nasa",
      "documents": [
        {
          "source-file": "docs.json",
          "includes-action-and-meta-data": true,
          "document-count": 1050
        }
      ]
    }
  ],
  "schedule": [
    {
      "operation": {
        "operation-type": "delete-index"
      }
    },
    {
      "operation": {
        "operation-type": "create-index"
      }
    },
    {
      "operation": {
        "operation-type": "cluster-health",
        "request-params": {
          "wait_for_status": "green"
        }
      }
    },
    {
      "operation": {
        "operation-type": "bulk",
        "bulk-size": 5000
      }
    }
  ]
}

Hi,

you can add the command line parameter --on-error=abort when starting your benchmark. Then Rally will abort on the first erroneous request with an error message what went wrong. This should hopefully help you to diagnose the problem.

Daniel

@danielmitterdorfer
Thanks a lot, I was able to solve the problem using that parameter.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.