The context of your script's execution is accessing the properties of a single doc from a single index. Your script would not be able to simultaneously see doc1 from transactions index and doc 2 from products index.
You can use the scroll api to read across 2 indices (no alias required) sorting on a common key and your application code would have to process the stream of interleaved results that are sorted on product ID. It would buffer a product's costs, applying them to the next transactions that share the common key. The newly enhanced transactions could then be inserted using the bulk API into a new index.
Of course this only works if a transaction is for exactly one product. If your "transaction" docs are more like orders with >1 product type then this approach clearly won't work as individual orders can't be sorted sensibly by a single key. In this scenario you'd have to do point queries on your product store to retrieve costs (a cache would obviously help here).
it would be better to calculate the margin per transaction not per product because for each transaction , we have a product sold with a specific quantity
yes i will do it regularly
the profit margin will be calculated with the currently recorded product cost which is stored in the costs index (i didn't specify the time of the cost of a product , in the index costs i have only product id and its cost nothing more)
i have 1055 products and more than 30000 transactions
Assuming you want to store costs/profit the best suggestion would be to fix your data "on the way in" if possible. 1,000 products is not a lot of data to keep in a RAM cache and lookup as you insert transaction data.
Logstash I believe has some "lookup" type features that could help with this (best to ask in that forum).
Advantages to doing it this way rather than a batch fix-later scheme is
costs would be recorded with the values current at point-of-sale
there is no lag between the logging of transactions and costing of transactions.
So i have to denormalize all the data i have manually ? i am sorry but i could'nt figure out what am i supposed to do in my case because denormalizing the data demands that for each transaction i will look for the product id with a query ?
This is a pretty common sort of enrichment task that any number of ETL tools support. Personally for a problem this small I'd opt for some custom Python and use a dict as a cache for product info. Like I said, Logstash has some of this data enrichment logic but it looks like a smarter solution with caching is still some way off .
Either way, here is probably not the forum to discuss approaches further as this is somewhat "upstream" of core elasticsearch.
Just a final question :can i store both data sets in one index but a different types (is it possible when both of the data sets has diferent mapping ) (noticing i have to change the product_id in one of these data sets into an other name) then use painless to test if documents has the same product identifier if the condition is verified i calculate the margin profit ??
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.