Set custom document ids on bulk insert

dmabuada · December 12, 2020, 6:21pm

Sorry, let me try to elaborate further:

I created a custom track from data in an existing cluster:
esrally create-track --track=test-track --target-hosts=127.0.0.1:9200 --indices="articles" --output-path=~/tracks

The track generator generated a json file with the documents from my cluster:

// documents.json file
{"contentType":"page", "contentId": 1, "title":"hello world"}
{"contentType":"post", "contentId": 2, "title":"hello world"}
{"contentType":"page", "contentId": 3, "title":"hello world"}
{"contentType":"page", "contentId": 4, "title":"hello world"}
{"contentType":"post", "contentId": 5, "title":"hello world"}
{"contentType":"post", "contentId": 6, "title":"hello world"}

Note here that the generator only outputs the source data and doesn't preserve any of the self-defined IDs the documents might have.

In my track.json, I reference my source file within the corpora and the operations I want to run:

"corpora": [
    {
      "name": "test-documents",
      "documents": [
        {
          "source-file": "documents.json",
          "document-count": 6,
          "uncompressed-bytes": 123
        }
      ]
    }
  ]

"schedule": [
    {
      "operation": {
        "operation-type": "delete-index"
      }
    },
    {
      "operation": {
        "operation-type": "create-index"
      }
    },
    {
      "operation": {
        "operation-type": "cluster-health",
        "request-params": {
          "wait_for_status": "green"
        }
      }
    },
    {
      "operation": {
        "operation-type": "bulk",
        "bulk-size": 5
      },
    },
  ]

The bulk operation indexes the documents with auto-generated IDs (expected), but I want to use self-defined IDs. I went ahead and modified the documents.json file to include the _id on each document. It follows the structure of the bulk API:

// modified documents.json file
{ "index" : { "_index" : "articles", "_id" : "1" } }
{"contentType":"page", "contentId": 1, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "2" } }
{"contentType":"post", "contentId": 2, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "3" } }
{"contentType":"page", "contentId": 3, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "4" } }
{"contentType":"page", "contentId": 4, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "5" } }
{"contentType":"post", "contentId": 5, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "6" } }
{"contentType":"post", "contentId": 6, "title":"hello world"}

The documents are successfully inserted when I re-run the track, but the IDs are still autogenerated. The self-defined IDs seem to be ignored.

Topic		Replies	Views
Is possible to extract _id doc with Rally and custom track? Elasticsearch rally	6	421	December 15, 2022
Bulk index document with `_id` field inside Elasticsearch	2	676	July 6, 2017
Generating custom _id when exporting from hadoop/spark to ES Elasticsearch es-hadoop	2	373	July 28, 2022
Spring + ElasticSearch - How to bulk create or update document with autogenerated string _id/id Elasticsearch	2	2341	November 8, 2022
Unable to set _id in bulk index with raw source documents Elasticsearch	3	28	August 12, 2024

Set custom document ids on bulk insert

Related topics