Is possible to extract _id doc with Rally and custom track?

rschirin · November 16, 2022, 11:39am

Hey there,
I would like to know if could be possible to extract the _id document when creating a custom-track with Rally.
The goal is to test write performance when the _id is provided and not

thanks

RickBoyd · November 16, 2022, 1:58pm

Unfortunately this is not implemented currently in Rally. We have an open enhancement request here: Support includes-action-and-meta-data for track generation · Issue #1134 · elastic/rally · GitHub

Alternatively, you could change the extracted data as the user did here

RickBoyd · November 16, 2022, 2:04pm

Also, note that if you don't need your own pre-existing IDs and just want to simulate provided vs auto-generated _id in general, the documentation has this covered for the bulk operation parameters:

conflicts (optional): Type of index conflicts to simulate. If not specified, no conflicts will be simulated (also read below on how to use external index ids with no conflicts). Valid values are: ‘sequential’ (A document id is replaced with a document id with a sequentially increasing id), ‘random’ (A document id is replaced with a document id with a random other id).

conflict-probability (optional, defaults to 25 percent): A number between [0, 100] that defines how many of the documents will get replaced. Combining conflicts=sequential and conflict-probability=0 makes Rally generate index ids by itself, instead of relying on Elasticsearch’s automatic id generation.

So, by default we use auto generation, but if you want Rally to provide explicit _id you would configure the operation with conflicts sequential and conflict-probability 0

rschirin · November 16, 2022, 2:14pm

Thank you for the help
With conflicts sequential , will Rally generate a first random _id (applied to the first doc) and then will increase sequentially it on the following documents?

RickBoyd · November 16, 2022, 2:56pm

The first _id generated is 0, so you will need a clean index in your track execution. Also note that if you have more than one bulk client that each client will get a batch of _ids to use in parallel, so the arrival of the _id in the system itself will not be sequential. If you need to run multiple bulk tasks in a row or execute against an existing index, random may be a better choice

rschirin · November 17, 2022, 6:41pm

probably this isn't the correct thread anyway I have a doubt about random/sequential _id.
I supposed that bulk operations with sequential _id should be at the end a little be slower than random _id. Should this consideration be correct?
After few tests I saw that there is no relevant time differences. How should be possible?
Probably should I fill the cluster with a lot of docs before to use Rally? In this way ES should check more docs to guarantee the _id integrity?
I am ingesting 1 million of docs.

system · December 15, 2022, 6:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Defining IDs and routing keys for documents in custom workloads in Rally Elasticsearch rally	4	725	July 3, 2019
Set custom document ids on bulk insert Elasticsearch rally	7	7185	January 26, 2021
Document updates in ES rally scenarious Elasticsearch rally	2	526	November 2, 2020
Possible collision with child document _id generation? Elasticsearch	5	891	July 6, 2017
Custom document Id Elasticsearch	2	1329	January 12, 2018

Is possible to extract _id doc with Rally and custom track?

Related topics