In following to custom parameter source and this project rally-eventdata-track ,
Seems that if I implement my own custom param source, then I have to implement file's readings in bulk operations basically? as it replace the whole part of files reading ?
I'm interested only to "massage"\format each document that rally reads from the corpora files,
I want to add an Id to each doc, so I have to use includes-action-and-meta-data which will ignore the target-index, causing me to generate custom data file for each operation (I planned to use the same file for multiple indices)
so my question comes down to, does rally support providing just a custom data "formatter" for bulk? or if not, I took a look in the params.py - appends it to the bulk and I can massage it right here and inject the index name.
Also another question, is it possible to tell copora's to use the same file ? or I have to build a file for each one of them? basically I tried and saw that it's looking for a file in the ..../data/{copora_name}/{file_name} and it's annoying to replicate the same file
I'm trying to accomplish my need without having to write complex and big code, and just the necessary changes
*******EDIT
I was able to achieve so, by adding inject of the index name, and removing from loader.py the logic that clears the index name if meta_data is true
i think that in general you should support generating custom meta_data for bulk operations, because I think it's widely needed and used, specially in case people need _Id property
Our recommendation in this case is that you pre-generate the data file with the action and meta-data line already included. I also get where you're coming from and that you'd rather reuse the same file and generate the action and meta-data line on the fly.
I just did a quick test based on http_logs (with one corpus) and it works at least within the same corpus, e.g. (Note how I use "documents-181998.json.bz2" in both cases):
However, as you've noticed a corpus is treated as unique so you cannot reuse the very same file across corpora. Can you please share a few more details why you need to create multiple corpora but still use the same file?
The idea was to load in parallel the same data files, to different indices that have the same mappings.
So in that case, I have to copy paste the same file for each corpora (simulate heavy load across multiple indices), and also have specific ID's so the operation will update existing doc, and not just insert and index (check that it exists and updates the whole doc).
The problem arises when for each set of mappings (let's say each "type" though I don't use types, just as example), I want to load ~10 indices, and o have 10 sets at least, that's 10X10 = 100 data files, each for starters is 4GB that's a lot.
As corpora supports only 1 index per corpora.
Another approach is to do what you demonstrated, just duplicate the same file over and over for each corpora.
That will save precocious storage, but the question is there a lag/delay when I re-ise the same file between each iteration?
I think Rally is already capable of doing what you want.
You can filter indices that you want to target from a corpus. Within that corpus you can reuse the same data file as I've already outlined in my previous post. Then you'd specify two bulk operations a parallel element. You can define which indices from the corpus definition you want to target with that bulk operation (see the indices property in the bulk operation docs). Suppose you have defined index-1, index-2, index-3 and index-4 in your corpus. Then you can limit the bulk-indexing operation to only use index-1 and index-3 with :
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.