Multiple Dimensions Analysis on aggregated dataset in Kibana

Hello, I have a question, on how to aggregate data for ElasticSearch and display it in Kibana correctly.
Here is my use case:
I have following structure of raw SQL table Schema:

UserId, UserGroup, Platform, ProductAsin, Category, SubCategory, Units, Price

UserId - user, who bought product(ProductAsin)
UserGroup - user can be from group A or group B
Platform - Mobile or Desktop
ProductAsin - asin of product, that this userId has bought
Category - category of product (Books, Cosmetics etc, we have 100 categories)
SubCategory - sub category of product (~2000 subcategories total)
Units - how many of ProductAsin user has bought
Price - price of ONE unit

I have more than 5 millions records per day.

I need to calculate different metrics. Right now, to make it simple, lets discuss just one of them, the most important called Average Order Product Sales ( avg ops). ops is calculate like number of units user bought multiple by price of one unit. So basically, it’s how much user has spent.

What I need to do is to calculate this metric (avg ops) and display it in Kibana, so users of Kibana can visualize this metric (avg ops) in different dimensions.
For example, what is avg ops in Books Category?
Or what’s the difference between avg ops in User Group A who used Mobile Platform and User Group A who used Desktop platform.
Or what is avg ops in Mobile Platform and Kids Books category for UserGroup B.

So basically, I need to have this one metrics (avg ops) to be available for all possible dimensions.

As far as I have large dataset (5 mln records per day), I think, I cannot upload all these records into ES everyday.
What I decided to do is to process input dataset in Java and after that, upload to ES.
How I processed it: I performed group by by all dimensions that I care about (UserGroup, Platform, Category, SubCategory).
So I got multiple(4) fields group by clause. And calculated total ops for each of this group. I got 5000 aggregated records out of 5 millions raw data.

After that I wanted to upload this aggregated data with ops for each group to ES. Here is my schema for aggregated data:
UserGroup, Platform, Category, SubCategory, OPS
But the problem in this case is that my OPS are calculated only for grouped by 4 fields records, for instance, if my document is ES looks like this:
A, Mobile, Book, KidsBook, 4356
B, Mobile, Book, Fantasy, 1234
B, Desktop, Book, KidsBook, 2345
A, Desktop, Clothes, Jeans, 2345
A, Mobile, Cosmetics, Lipstic, 656

, I cannot know from this data , what is the ops only for userGroup A with mobile platform. Or only for Books, or only for Books within Mobile platform.

I need help or your thought on this problem, I am trying to solve.



first of all, what you are describing is an ideal use-case for Elasticsearch/Kibana. If you would index the original documents (instead of the aggregated results) it would even be way more flexible in the end to decide what you visualize, but I understand, that this might be not feasible for you, e.g. due to storage limitations. But Elasticsearch itself wouldn't have any problems if you would give it just 5 million documents a day and try to aggregate on them.

To work with your aggregated data, let's just create a simple metric that will just show the OPS for all books bought via mobile.

You would just create a metrics visualization in Kibana, and add filters on the top of the screen for "Platform is mobile" and "Category is book".

As the Metrics visualization you would select Sum over the "OPS" field, to get the OPS of all the different documents summed up, to calculate the overall OPS. You could also use an Average if you would e.g. be interested in the average OPS of all subcategories (since you have one document per subcategory) within Books sold via Mobile.

You could of course apply the same way filters to just get the OPS of User Group A on Mobile.

Hope that could help getting your started. Please feel free to reach out with any further questions.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.