My data model looks something like this: I have a product document and a list of users activity in our website. Each user activity is expressed like this: user [searched] for the product at [07-18-2017], user [bought] the product at [07-16-2017]. Each product have something between 1-5,000 activities, and I have something like 20,000 products.
I want to be able to get the top 10 documents with the most searched vs. bought percentage. It means that the query first picks some relevant product documents, then sort the documents by calculating for EACH document the number of bought activities, searched activities, and then divide those 2 parameters. In addition, this calculation have to choose the right activities for each product (for example: sometimes I want to make the calculation on only activities made in a specific date range).
I think that the best way to do what I need is by using script.
I have 3 different approaches for using them:
- Modeling the products as root documents, and the activities as nested documents. The problem: it means that the script has to access the
sourcefield every time - meaning parsing and anti-cache penalty.
- Same as 1, but using parent-child index. Might be more efficient since there are many, though immutable unchanged, children to each parent. But, is it even possible to make script about the children?
- Same as 1, but using the
Bucket Script Aggregationwhich would be a sub aggregation under a
Nested Aggregation. This will allow me to avoid accessing the
sourcefield in the script itself and use the aggregation cache. By the way, @colings86 , it seems to have an open issue about ordering by
Bucket Script Aggregation? But there's also some kind of a workaround here?
- Modeling both product and its activities in the same document. How? By indexing each activity as a long string instead of a separate neseted document. For example: instead of indexing an activity as an object:
"Action" : "Search",
"Timestamp" : "07-18-2017 09:21:56"
Indexing an activity as a string: "Search#07-18-2017 09:21:56". This will allow me to save the activities in an array (keyword data type) which is saved as
doc_value- this is the fastest way by the documentation. The script iterate through all the items in the array, and for each item splits the '#' to 2 different values. I know it looks a bit strange, but it should work. That's the best solution, in my opinion.
What do you think about using the script for my task?
How do you consider its performance?