Set custom document ids on bulk insert

dmabuada · December 11, 2020, 10:02pm

I'm attempting to bulk insert generated data from the track generator (I created my own custom track), but I'd like to disable auto-generated IDs on insert. I modified my data files to include the _id prop on each document but esrally seems to ignore it.

I'm creating a benchmark task that simulates the execution of a percolate query that's passed a known ID:

"percolate": {
  "field": "query",
  "index": "my-index",
  "id": "2", // custom document ID
}

Please advise on how I may be able to achieve that.

aaron-nimocks · December 12, 2020, 12:02am

Is this what you are trying to do?

POST _bulk
{"index":{"_index":"test","_id":"1"}}
{"field1":"value1"}
{"delete":{"_index":"test","_id":"2"}}
{"create":{"_index":"test","_id":"3"}}
{"field1":"value3"}
{"update":{"_id":"1","_index":"test"}}
{"doc":{"field2":"value2"}}

dmabuada · December 12, 2020, 6:21pm

Sorry, let me try to elaborate further:

I created a custom track from data in an existing cluster:
esrally create-track --track=test-track --target-hosts=127.0.0.1:9200 --indices="articles" --output-path=~/tracks

The track generator generated a json file with the documents from my cluster:

// documents.json file
{"contentType":"page", "contentId": 1, "title":"hello world"}
{"contentType":"post", "contentId": 2, "title":"hello world"}
{"contentType":"page", "contentId": 3, "title":"hello world"}
{"contentType":"page", "contentId": 4, "title":"hello world"}
{"contentType":"post", "contentId": 5, "title":"hello world"}
{"contentType":"post", "contentId": 6, "title":"hello world"}

Note here that the generator only outputs the source data and doesn't preserve any of the self-defined IDs the documents might have.

In my track.json, I reference my source file within the corpora and the operations I want to run:

"corpora": [
    {
      "name": "test-documents",
      "documents": [
        {
          "source-file": "documents.json",
          "document-count": 6,
          "uncompressed-bytes": 123
        }
      ]
    }
  ]

"schedule": [
    {
      "operation": {
        "operation-type": "delete-index"
      }
    },
    {
      "operation": {
        "operation-type": "create-index"
      }
    },
    {
      "operation": {
        "operation-type": "cluster-health",
        "request-params": {
          "wait_for_status": "green"
        }
      }
    },
    {
      "operation": {
        "operation-type": "bulk",
        "bulk-size": 5
      },
    },
  ]

The bulk operation indexes the documents with auto-generated IDs (expected), but I want to use self-defined IDs. I went ahead and modified the documents.json file to include the _id on each document. It follows the structure of the bulk API:

// modified documents.json file
{ "index" : { "_index" : "articles", "_id" : "1" } }
{"contentType":"page", "contentId": 1, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "2" } }
{"contentType":"post", "contentId": 2, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "3" } }
{"contentType":"page", "contentId": 3, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "4" } }
{"contentType":"page", "contentId": 4, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "5" } }
{"contentType":"post", "contentId": 5, "title":"hello world"}
{ "index" : { "_index" : "articles", "_id" : "6" } }
{"contentType":"post", "contentId": 6, "title":"hello world"}

The documents are successfully inserted when I re-run the track, but the IDs are still autogenerated. The self-defined IDs seem to be ignored.

RickBoyd · December 14, 2020, 4:26pm

Hi @dmabuada, welcome!
Rally should pick up your inserted metadata if you add "includes-action-and-meta-data": "true" (doc) to your documents entry in your track's corpora stanza. Please let us know if that works for you.

Also, FYI I have opened a Rally issue for an enhancement to grab these metadata for you in the generated corpora: https://github.com/elastic/rally/issues/1134

dmabuada · December 15, 2020, 10:10pm

Thanks so much for the reply, @RickBoyd.

I went ahead and added "includes-action-and-meta-data": "true" to my config but I keep getting the following error:

[ERROR] Cannot race. Error in load generator [2]
	("Request returned an error. Error type: transport, Description: illegal_argument_exception ({'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]'}], 'type': 'illegal_argument_exception', 'reason': 'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]'}, 'status': 400})", None)

I confirmed that the actions in my json file are defined correctly because issuing the following POST request via curl successfully inserts the documents:

curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/index-test/_bulk?pretty' --data-binary @documents-1k.json

// documents-1k.json file with appropriate newlines
{ "index" : { "_index" : "index-test", "_id" : "1" } }
{"contentType":"page", "contentId": 1, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "2" } }
{"contentType":"post", "contentId": 2, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "3" } }
{"contentType":"page", "contentId": 3, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "4" } }
{"contentType":"page", "contentId": 4, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "5" } }
{"contentType":"post", "contentId": 5, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "6" } }
{"contentType":"post", "contentId": 6, "title":"hello world"}
// final line of data has a newline also

Might there be something else that I'm missing?

I appreciate all the help and I look forward to being able to grab the metadata from the generated corpora.

RickBoyd · December 17, 2020, 2:59pm

Hi! I was unfortunately not able to reproduce with the information provided. What follows are some artifacts that will hopefully allow you to figure out what the difference may be? Please let us know what you find.

ESRally Output

(.venv) rick.boyd@Ricks-MBP discuss-202012 % esrally --track-path=.

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Preparing for race ...
[INFO] Preparing file offset table for [/Users/rick.boyd/scratch/discuss-202012/docs.json] ... [OK]
[INFO] Racing on track [discuss-202012] and car ['defaults'] with version [8.0.0-SNAPSHOT].

Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running cluster-health                                                         [100% done]
Running bulk                                                                   [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |           Task |       Value |   Unit |
|---------------------------------------------------------------:|---------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                |     0.00015 |    min |
|             Min cumulative indexing time across primary shards |                |           0 |    min |
|          Median cumulative indexing time across primary shards |                |     7.5e-05 |    min |
|             Max cumulative indexing time across primary shards |                |     0.00015 |    min |
|            Cumulative indexing throttle time of primary shards |                |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |                |           0 |    min |
| Median cumulative indexing throttle time across primary shards |                |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |                |           0 |    min |
|                        Cumulative merge time of primary shards |                |           0 |    min |
|                       Cumulative merge count of primary shards |                |           0 |        |
|                Min cumulative merge time across primary shards |                |           0 |    min |
|             Median cumulative merge time across primary shards |                |           0 |    min |
|                Max cumulative merge time across primary shards |                |           0 |    min |
|               Cumulative merge throttle time of primary shards |                |           0 |    min |
|       Min cumulative merge throttle time across primary shards |                |           0 |    min |
|    Median cumulative merge throttle time across primary shards |                |           0 |    min |
|       Max cumulative merge throttle time across primary shards |                |           0 |    min |
|                      Cumulative refresh time of primary shards |                |     0.00055 |    min |
|                     Cumulative refresh count of primary shards |                |           5 |        |
|              Min cumulative refresh time across primary shards |                |           0 |    min |
|           Median cumulative refresh time across primary shards |                |    0.000275 |    min |
|              Max cumulative refresh time across primary shards |                |     0.00055 |    min |
|                        Cumulative flush time of primary shards |                |           0 |    min |
|                       Cumulative flush count of primary shards |                |           0 |        |
|                Min cumulative flush time across primary shards |                |           0 |    min |
|             Median cumulative flush time across primary shards |                |           0 |    min |
|                Max cumulative flush time across primary shards |                |           0 |    min |
|                                        Total Young Gen GC time |                |           0 |      s |
|                                       Total Young Gen GC count |                |           0 |        |
|                                          Total Old Gen GC time |                |           0 |      s |
|                                         Total Old Gen GC count |                |           0 |        |
|                                                     Store size |                | 5.02542e-06 |     GB |
|                                                  Translog size |                | 7.00355e-07 |     GB |
|                                         Heap used for segments |                |  0.00178909 |     MB |
|                                       Heap used for doc values |                | 7.24792e-05 |     MB |
|                                            Heap used for terms |                |  0.00112915 |     MB |
|                                            Heap used for norms |                |  0.00012207 |     MB |
|                                           Heap used for points |                |           0 |     MB |
|                                    Heap used for stored fields |                | 0.000465393 |     MB |
|                                                  Segment count |                |           1 |        |
|                                       100th percentile latency | cluster-health |     30009.2 |     ms |
|                                  100th percentile service time | cluster-health |     30009.2 |     ms |
|                                                     error rate | cluster-health |         100 |      % |
|                                                 Min Throughput |           bulk |        6.44 | docs/s |
|                                              Median Throughput |           bulk |        6.44 | docs/s |
|                                                 Max Throughput |           bulk |        6.44 | docs/s |
|                                        50th percentile latency |           bulk |     455.218 |     ms |
|                                       100th percentile latency |           bulk |     822.333 |     ms |
|                                   50th percentile service time |           bulk |     455.218 |     ms |
|                                  100th percentile service time |           bulk |     822.333 |     ms |
|                                                     error rate |           bulk |           0 |      % |

[WARNING] Error rate is 100.0 for operation 'cluster-health'. Please check the logs.
[WARNING] No throughput metrics available for [cluster-health]. Likely cause: Error rate is 100.0%. Please check the logs.

--------------------------------
[INFO] SUCCESS (took 73 seconds)

docs.json

{ "index" : { "_index" : "index-test", "_id" : "1" } }
{"contentType":"page", "contentId": 1, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "2" } }
{"contentType":"post", "contentId": 2, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "3" } }
{"contentType":"page", "contentId": 3, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "4" } }
{"contentType":"page", "contentId": 4, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "5" } }
{"contentType":"post", "contentId": 5, "title":"hello world"}
{ "index" : { "_index" : "index-test", "_id" : "6" } }
{"contentType":"post", "contentId": 6, "title":"hello world"}

index.json

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

track.json

{
"indices": [
    {
        "name":"discuss",
        "body":"index.json"
    }
],
"corpora": [
    {
      "name": "test-documents",
      "documents": [
        {
          "source-file": "docs.json",
          "document-count": 6,
          "includes-action-and-meta-data": true
        }
      ]
    }
  ]
,
"schedule": [
    {
      "operation": {
        "operation-type": "delete-index"
      }
    },
    {
      "operation": {
        "operation-type": "create-index"
      }
    },
    {
      "operation": {
        "operation-type": "cluster-health",
        "request-params": {
          "wait_for_status": "green"
        }
      }
    },
    {
      "operation": {
        "operation-type": "bulk",
        "bulk-size": 5
      }
    }
  ]
}

note that i'm ignoring the cluster-health failure as it's not relevant to your error
(edit: just verified the explicit _id mappings are there as intended as well)

dmabuada · December 29, 2020, 8:22pm

Hi @RickBoyd,
I got it all squared away now. I realized that the issue only persisted when downloading the data from S3. It seems that Rally initially copies the data from the remote location into /.rally/benchmarks/data/test-documents/docs.json, but never pulls a fresh copy of the file if it already exists in the dir. It kept reading from the "cached" file, which had the wrongly defined actions.

I deleted the file to force a fresh download.

Thanks for all the help!

-Dalia

system · January 26, 2021, 8:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is possible to extract _id doc with Rally and custom track? Elasticsearch rally	6	421	December 15, 2022
Bulk index document with `_id` field inside Elasticsearch	2	676	July 6, 2017
Generating custom _id when exporting from hadoop/spark to ES Elasticsearch es-hadoop	2	373	July 28, 2022
Spring + ElasticSearch - How to bulk create or update document with autogenerated string _id/id Elasticsearch	2	2341	November 8, 2022
Unable to set _id in bulk index with raw source documents Elasticsearch	3	28	August 12, 2024

Set custom document ids on bulk insert

Related topics