Aggregating the distinct result to a grouping

Hi All,
I am very new into this elasticsearch and I could not find any related discussion on this.
Currently, I am using elasticsearch to perform analytic for our data. I migrated our data from the RDBMS to the elasticsearch via logstash.

I have example data in such format in the index userLogs and type log with _id = (id from the RDBMS )

Index : userLogs/logs/{id}
{ [ { action : 'click', page : 'product_page', date : '2016-02-02', userId : 'userId1', user : { name : 'Jason', age : 20, country : 'Singapore', ethnicity : 'Chinese' }, { action : 'click', date : '2016-02-03', userId : 'userId2', user : { name : 'James', age : 23, country : 'Australia', ethnicity : 'Indian' }, { action : 'click', date : '2016-02-02', userId : 'userId1', user : { name : 'Jason', age : 20, country : 'Singapore', ethnicity : 'Chinese' }, ] }
From the data above, there will be duplicate of data as this is the logging of users. In one day, there would be multiple records of same user as long as the user is having activity.
Assuming the data have been accumulated for some time now, user demographic research needed to be perform across this data. Let say I want to search at the month of February , how many distinct users have been active under these data. Then from the distinct user, the demographic result of the user needed to be collected. For example, we would have 5,000 activity for that month, but only 500 distinct users out of those 5,000. Then I want the demographic of those 500 users. I know If i performed terms filter aggregation, I could have the results from the 5,000 instead of the 500 and this is not distinct result.
In shorts, I need to filter 5,000 data to only distinct 500 users in one bucket with their respective user data inside there and then doing aggregation on the 500 users with their demographic profiles count only such as age , country and ethnicity.

In RDBMS, i could perform CTE (common table expression) to group all those distinct user row into one table then perform aggregation such as select age, ethnicity, country, count(*) from CTE_distinct_users group by age, ethnicity , country

Any ways to achieve this result in elasticsearch? Please help :slight_smile:

It's early and I haven't had coffee, but I think what you are asking is how to get a list of the distinct users (and their profile information) given a certain set of criteria about the users' activity (e.g. users who visited in February).

The way I would suggest to do this based on the event model you've presented is to include a query for the limiting criteria (activity in February), size of 0, have a terms aggregation on the userId, and, inside of that terms aggregation, include a "top hits" aggregation, size of 1. Then your result will include the result distinct by userId, and it will just be a matter of extracting them from the JSON return.

Personally, I would further suggest that you not include the entire user profile information in each of your events. But I also wouldn't suggest using a parent/child relationship between the events and the related users, as that approach will incur significant resource overhead in what I infer is a system which will have a really large number of events. Instead I would suggest resolving the user profile information as-needed from the application server using the userId.

I hope that's close to what you are looking for.

Thanks for the that.
However, from the top_hits I will have each distinct users but let say I want to perform grouping by each of the user's profile. What i mean is that passing all those results from the distinct users and I want to group their age , country and ethnicity to know their count. For example, 500 distinct users and get their respective user profile such as [Chinese (200) , Indian (100) , Others (200) ] for ethnicity.

From ethnicity and further break down as follow :
Chinese (200)
-> Age 18 (10)
->Age 23 (30)
->Age 26 (50)
->Age 28 (20)

Is that possible ? :slight_smile:

1 Like