Is possible to extract _id doc with Rally and custom track?

Hey there,
I would like to know if could be possible to extract the _id document when creating a custom-track with Rally.
The goal is to test write performance when the _id is provided and not

thanks

Unfortunately this is not implemented currently in Rally. We have an open enhancement request here: Support includes-action-and-meta-data for track generation · Issue #1134 · elastic/rally · GitHub

Alternatively, you could change the extracted data as the user did here

Also, note that if you don't need your own pre-existing IDs and just want to simulate provided vs auto-generated _id in general, the documentation has this covered for the bulk operation parameters:

conflicts (optional): Type of index conflicts to simulate. If not specified, no conflicts will be simulated (also read below on how to use external index ids with no conflicts). Valid values are: ‘sequential’ (A document id is replaced with a document id with a sequentially increasing id), ‘random’ (A document id is replaced with a document id with a random other id).

conflict-probability (optional, defaults to 25 percent): A number between [0, 100] that defines how many of the documents will get replaced. Combining conflicts=sequential and conflict-probability=0 makes Rally generate index ids by itself, instead of relying on Elasticsearch’s automatic id generation.

So, by default we use auto generation, but if you want Rally to provide explicit _id you would configure the operation with conflicts sequential and conflict-probability 0

Thank you for the help :slight_smile:
With conflicts sequential , will Rally generate a first random _id (applied to the first doc) and then will increase sequentially it on the following documents?

The first _id generated is 0, so you will need a clean index in your track execution. Also note that if you have more than one bulk client that each client will get a batch of _ids to use in parallel, so the arrival of the _id in the system itself will not be sequential. If you need to run multiple bulk tasks in a row or execute against an existing index, random may be a better choice

probably this isn't the correct thread anyway I have a doubt about random/sequential _id.
I supposed that bulk operations with sequential _id should be at the end a little be slower than random _id. Should this consideration be correct?
After few tests I saw that there is no relevant time differences. How should be possible?
Probably should I fill the cluster with a lot of docs before to use Rally? In this way ES should check more docs to guarantee the _id integrity?
I am ingesting 1 million of docs.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.