It's my first ElasticSearch project and I could use some input.
Situation: Ecommerce scenario with 6000 products and 6000 customers. Each customer can have a specific discount for a product and every customer belongs to a single price group. Right now, the (full) prices for all price groups are nested in the product documents which enables us to get a price filter with the full price of the product. Objective: To create a range aggregation based on the customer's discounted prices when they log in, instead of the full prices as it is now. Challenge: We are trying to get price aggregations based on a specific customer's prices.
Solution 1: Flatten the documents and nest all customer prices in the products. I'm concerned about the performance on searches and rendering result sets after nesting 6000 docs in a product document.
Solution 2: Put prices in different index and do joins to retrieve the right ones - even bigger performance kick than Solution 1.
What do you think? Is there a better approach to it?
I'm wondering if you can do something sneaky here.
Maybe if you had a single price field here which took an array of values - one value for each customer.
The trick is to first multiply the prices by the customer number to project them into a reserved "number space". As an example:
float maxPriceForAProduct = 1000;
To do a range query for products between $10 and $20 for customer 7 you'd do this:
It is a bit of a hacky solution. Instead of doing this however, wouldn't it be a bit easier if I stored the individual discounts as a dictionary instead and do the calculation on the fly in Elastic?
I'm not sure exactly what you have in mind but I imagine it would require a script running across all docs which might be slow.
With the approach I outlined you'd be using the same sort of data structures that make GEO queries fast - instead of querying ranges of lat/lon numbers you'd be querying ranges of prices where each customer had their own space carved out.
Hm... Wouldn't it be slower if we index ranges since if most of the queries I will have the customer number? Wouldn't it be slightly more performant to have it the other way around?
I'm not sure exactly what you have in mind but I imagine it would require a script running across all docs which might be slow.
Each customer has a percentage discount on the base price of a product. But yes, it would require a script and performance is a big concern in this case.
I'm only wondering one thing. Would there be a difference between having this array and having a nested dictionary, for example?
Just checking assumptions here - I assume this is the worst-case scenario where a customer can have a per-product special discount rather than just a single percentage discount across all products.
The approach I describe relies on using the search index to narrow down to clusters in a numeric space which contain all of the document IDs in that range. This cost is sub-linear to the number of docs.
Any approach that doesn't use the index and relies on scripts scanning document properties is a cost linear to the number of docs.
No it wouldn't. Customer number filter will lead enumeration and will be efficiently intersected with any product filter if it's requested.
Inverted index might seem odd from another background. When we design index we just need to make sure that query enumerates only search results (that's unavoidable), but not something which doesn't go to the results i.e no full scans.
This layout relies on smaller number of ranges, and vulnerable to dynamic excessive range grid with overlapping ranges for example.
Another advantage of this layout is that it extends existing index and can be disabled for querying via kill switch.
And of course, trust noone, prototype yourself.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.