Hey there,
I would like to know if could be possible to extract the _id
document when creating a custom-track
with Rally.
The goal is to test write performance when the _id
is provided and not
thanks
Hey there,
I would like to know if could be possible to extract the _id
document when creating a custom-track
with Rally.
The goal is to test write performance when the _id
is provided and not
thanks
Unfortunately this is not implemented currently in Rally. We have an open enhancement request here: Support includes-action-and-meta-data for track generation · Issue #1134 · elastic/rally · GitHub
Alternatively, you could change the extracted data as the user did here
Also, note that if you don't need your own pre-existing IDs and just want to simulate provided vs auto-generated _id
in general, the documentation has this covered for the bulk
operation parameters:
conflicts (optional): Type of index conflicts to simulate. If not specified, no conflicts will be simulated (also read below on how to use external index ids with no conflicts). Valid values are: ‘sequential’ (A document id is replaced with a document id with a sequentially increasing id), ‘random’ (A document id is replaced with a document id with a random other id).
conflict-probability (optional, defaults to 25 percent): A number between [0, 100] that defines how many of the documents will get replaced. Combining conflicts=sequential and conflict-probability=0 makes Rally generate index ids by itself, instead of relying on Elasticsearch’s automatic id generation.
So, by default we use auto generation, but if you want Rally to provide explicit _id
you would configure the operation with conflicts
sequential
and conflict-probability
0
Thank you for the help
With conflicts
sequential
, will Rally generate a first random _id
(applied to the first doc) and then will increase sequentially it on the following documents?
The first _id
generated is 0
, so you will need a clean index in your track execution. Also note that if you have more than one bulk
client that each client will get a batch of _id
s to use in parallel, so the arrival of the _id
in the system itself will not be sequential. If you need to run multiple bulk
tasks in a row or execute against an existing index, random
may be a better choice
probably this isn't the correct thread anyway I have a doubt about random/sequential _id
.
I supposed that bulk
operations with sequential _id
should be at the end a little be slower than random _id
. Should this consideration be correct?
After few tests I saw that there is no relevant time differences. How should be possible?
Probably should I fill the cluster with a lot of docs before to use Rally? In this way ES should check more docs to guarantee the _id integrity?
I am ingesting 1 million of docs.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.