Hello, I have a question, on how to aggregate data for ElasticSearch and display it in Kibana correctly.
Here is my use case:
I have following structure of raw SQL table Schema:
UserId, UserGroup, Platform, ProductAsin, Category, SubCategory, Units, Price
UserId - user, who bought product(ProductAsin)
UserGroup - user can be from group A or group B
Platform - Mobile or Desktop
ProductAsin - asin of product, that this userId has bought
Category - category of product (Books, Cosmetics etc, we have 100 categories)
SubCategory - sub category of product (~2000 subcategories total)
Units - how many of ProductAsin user has bought
Price - price of ONE unit
I have more than 5 millions records per day.
I need to calculate different metrics. Right now, to make it simple, lets discuss just one of them, the most important called Average Order Product Sales ( avg ops). ops is calculate like number of units user bought multiple by price of one unit. So basically, it’s how much user has spent.
What I need to do is to calculate this metric (avg ops) and display it in Kibana, so users of Kibana can visualize this metric (avg ops) in different dimensions.
For example, what is avg ops in Books Category?
Or what’s the difference between avg ops in User Group A who used Mobile Platform and User Group A who used Desktop platform.
Or what is avg ops in Mobile Platform and Kids Books category for UserGroup B.
So basically, I need to have this one metrics (avg ops) to be available for all possible dimensions.
As far as I have large dataset (5 mln records per day), I think, I cannot upload all these records into ES everyday.
What I decided to do is to process input dataset in Java and after that, upload to ES.
How I processed it: I performed group by by all dimensions that I care about (UserGroup, Platform, Category, SubCategory).
So I got multiple(4) fields group by clause. And calculated total ops for each of this group. I got 5000 aggregated records out of 5 millions raw data.
After that I wanted to upload this aggregated data with ops for each group to ES. Here is my schema for aggregated data:
UserGroup, Platform, Category, SubCategory, OPS
But the problem in this case is that my OPS are calculated only for grouped by 4 fields records, for instance, if my document is ES looks like this:
A, Mobile, Book, KidsBook, 4356
B, Mobile, Book, Fantasy, 1234
B, Desktop, Book, KidsBook, 2345
A, Desktop, Clothes, Jeans, 2345
A, Mobile, Cosmetics, Lipstic, 656
, I cannot know from this data , what is the ops only for userGroup A with mobile platform. Or only for Books, or only for Books within Mobile platform.
I need help or your thought on this problem, I am trying to solve.