Merging two indexes by a common field


(marwa) #1

Hi,

I have an index named 'transactions' and an other one called 'costs'

the transactions indexe contains
{ product_id: "1111", price_unit: "23.56", customer_name:"Marda Elbin" }

the costs index contains the id of each product with its costs

I want to merge both indexes based on the product_id or to make a join betwen them based on the same field (product_id)

My purpose is the use expression language to calculate the revenues - costs for each product thus i used aliases :slight_smile:

POST _aliases
{
"actions" : [
{ "add" : { "index" : "costs", "alias" : "alias1" } },
{ "add" : { "index" : "transactions", "alias" : "alias1" } }
]
}
then i tryed to calculate revenues - costs with this query
GET alias1/_search
{
"query" : {
"match_all": {}
},
"script_fields" : {
"test1" : {
"script" : {
"lang": "painless",
"inline": "int gross_profit_margin = 0; if (doc[product_identifier].value == doc['product_id'].value){ gross_profit_margin + = (doc['product_price'].value * doc['quantity'].value * (1 - doc['discount'].value /100)) - (doc['average_marketing_cost'].value + doc['average_promotional_cost'].value + doc['direct_and_indirect_cost'].value ) return gross_profit_margin }"
}
}
}
}

this error appears :
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "compile error",
"script_stack": [
"... ){ gross_profit_margin + = (doc['product_price'].v ...",
" ^---- HERE"
],
"script": "int gross_profit_margin = 0; if (doc[product_identifier].value == doc['product_id'].value){ gross_profit_margin + = (doc['product_price'].value * doc['quantity'].value * (1 - doc['discount'].value /100)) - (doc['average_marketing_cost'].value + doc['average_promotional_cost'].value + doc['direct_and_indirect_cost'].value ) return gross_profit_margin }",
"lang": "painless"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,

so my questions are

  1. How can i merge indexes by a common fields
  2. does aliases allow to do so

Thenk you??


(Mark Harwood) #2

The context of your script's execution is accessing the properties of a single doc from a single index. Your script would not be able to simultaneously see doc1 from transactions index and doc 2 from products index.

You can use the scroll api to read across 2 indices (no alias required) sorting on a common key and your application code would have to process the stream of interleaved results that are sorted on product ID. It would buffer a product's costs, applying them to the next transactions that share the common key. The newly enhanced transactions could then be inserted using the bulk API into a new index.
Of course this only works if a transaction is for exactly one product. If your "transaction" docs are more like orders with >1 product type then this approach clearly won't work as individual orders can't be sorted sensibly by a single key. In this scenario you'd have to do point queries on your product store to retrieve costs (a cache would obviously help here).


(marwa) #3

I didn't not get what you really mean .Can you explain more to me please

thank you


(Mark Harwood) #4

Perhaps it makes sense to clarify the requirement before we clarify the solution:

  • Do any of your "transaction" docs contain >1 product?
  • Are you after a single number for margin across all products or per-product margins?
  • Are you doing this calculation regularly?
  • Should profit margins be measured as product-cost-at-time-of-sale or based on currently recorded product cost?
  • How many products and transactions do you have?

(marwa) #5
  • for each transaction i have only one product
  • it would be better to calculate the margin per transaction not per product because for each transaction , we have a product sold with a specific quantity
  • yes i will do it regularly
  • the profit margin will be calculated with the currently recorded product cost which is stored in the costs index (i didn't specify the time of the cost of a product , in the index costs i have only product id and its cost nothing more)
  • i have 1055 products and more than 30000 transactions

thank you


(Mark Harwood) #6

Assuming you want to store costs/profit the best suggestion would be to fix your data "on the way in" if possible. 1,000 products is not a lot of data to keep in a RAM cache and lookup as you insert transaction data.
Logstash I believe has some "lookup" type features that could help with this (best to ask in that forum).

Advantages to doing it this way rather than a batch fix-later scheme is

  1. costs would be recorded with the values current at point-of-sale
  2. there is no lag between the logging of transactions and costing of transactions.

(marwa) #7

So you are suggesting that i remodel my data and to restore both indexes in one index ???


(Mark Harwood) #8

Denormalization for the win. https://www.elastic.co/guide/en/elasticsearch/guide/current/denormalization.html


(marwa) #9

So i have to denormalize all the data i have manually ? i am sorry but i could'nt figure out what am i supposed to do in my case because denormalizing the data demands that for each transaction i will look for the product id with a query ?


(Mark Harwood) #10

This is a pretty common sort of enrichment task that any number of ETL tools support. Personally for a problem this small I'd opt for some custom Python and use a dict as a cache for product info. Like I said, Logstash has some of this data enrichment logic but it looks like a smarter solution with caching is still some way off .

Either way, here is probably not the forum to discuss approaches further as this is somewhat "upstream" of core elasticsearch.


(marwa) #11

Just a final question :can i store both data sets in one index but a different types (is it possible when both of the data sets has diferent mapping ) (noticing i have to change the product_id in one of these data sets into an other name) then use painless to test if documents has the same product identifier if the condition is verified i calculate the margin profit ??


(Mark Harwood) #12

No because of my first comment:


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.