Get the aggregation that measures unique users from different cities using elasticsearch event log index

Nisrein_Sada · May 12, 2020, 1:39pm

I am trying to measure the number of users from different countries by reading the events_log_index ES, this index has all the user interactions from a mobile. Therefore the user might have several records in the event index.

some of the fields in the mapping include:

sid: session_id
uid: user_id
location: nested object containing the users country and city

    {
    "properties": {
        "city": {
            "type": "keyword"
        },
        "country_long": {
            "type": "keyword"
        },
        "country_short": {
            "type": "keyword"
        },
        "ip": {
            "type": "keyword"
        },
        "loc": {
            "type": "geo_point"
        },
        "region": {
            "type": "keyword"
        }
    }
}

I want to know the number of unique users from different countries and then different cities in these countries, so to get the number of users from different countries I used the following query:

   {
    "size": 0,
    "query": {
            "match_all":{}
        },
    "aggs" : {
        "users_per_country" : {
            "terms" : {
                  "field": "location.country_long",
			        "size": 2,
			        "order": {
			          "_count": "desc"
			        }
            },
            "aggs": {
                "distinct_users": {
                    "cardinality": {
                        "field": "uid"
                    }
                }
            }
        }
    }
}

The results were as follows:

{
    "took": 3702,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "users_per_country": {
            "doc_count_error_upper_bound": 14782,
            "sum_other_doc_count": 847455,
            "buckets": [
                {
                    "key": "Egypt",
                    "doc_count": 35337333,
                    "distinct_users": {
                        "value": 434430
                    }
                },
                {
                    "key": "Jordan",
                    "doc_count": 4649252,
                    "distinct_users": {
                        "value": 65504
                    }
                }
            ]
        }
    }
}

So then I wanted to check for each city in one of these countries the distinct number of users, so I used:

{
    "size": 0,
    "query": {
            "term": {
                "location.country_long": "Egypt"
            }
        },
    "aggs" : {
        "users_per_country" : {
            "terms" : {
                  "field": "location.city",
			        "size": 5,
			        "order": {
			          "_count": "desc"
			        }
            },
            "aggs": {
                "distinct_users": {
                    "cardinality": {
                        "field": "uid"
                    }
                }
            }
        }
    }
}

the result was as follows:

{
    "took": 3034,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "users_per_country": {
            "doc_count_error_upper_bound": 236846,
            "sum_other_doc_count": 10455532,
            "buckets": [
                {
                    "key": "Al Jizah",
                    "doc_count": 8552318,
                    "distinct_users": {
                        "value": 222287
                    }
                },
                {
                    "key": "Cairo",
                    "doc_count": 6923425,
                    "distinct_users": {
                        "value": 188695
                    }
                },
                {
                    "key": "Al Khankah",
                    "doc_count": 4539083,
                    "distinct_users": {
                        "value": 151461
                    }
                },
                {
                    "key": "Alexandria",
                    "doc_count": 3901927,
                    "distinct_users": {
                        "value": 125361
                    }
                },
                {
                    "key": "Al Mansurah",
                    "doc_count": 965048,
                    "distinct_users": {
                        "value": 36865
                    }
                }
            ]
        }
    }
}

The main issue is that comparing the total number of users from Egypt to the sum of users in each bucket from the city does not add up, so you can please advise how to construct this query to get the desired result?

Hendrik_Muhs · May 13, 2020, 5:49am

The cardinality aggregation returns an approximate count in order to scale. You can tweak the precision, however it will remain approximate.

To get exact counts you can use a composite aggregation and group by user id.

Another alternative, which is based on composite aggregations, is an entity centric index. In a nutshell you create a 2nd index build around your users, so you turn single events into sessions grouped by user id. This will give you a entity focused view on users, which is useful to answer all kinds of questions. The feature is called transform, I suggest this link to get an idea.

(An even easier start might be, to watch this recording)

Nisrein_Sada · May 13, 2020, 8:31am

Thank Hendrik_Muhs for your reply

Can you please explain how composite aggregations will help me in getting the unique number of users per city because what I understand that it is equivalent to group by 'city' and 'userId' which is not the intended purpose? can you please provide me with an example ?

Thank u for the link for the transform feature i was not aware and i will try it?

Hendrik_Muhs · May 13, 2020, 10:14am

Yes, I was thinking of grouping by uid and city to replace the cardinality aggregation which by design returns an approximate count. However, with that I see, you loose the ability to get a count by city only. Maybe tweaking the precision_threshold is the better option for you. Note that even with the default the precision error is quite low. Is it a hard requirement to get exact numbers?

Nisrein_Sada · May 14, 2020, 7:07am

Thank you hendrik
it is not a hard requirement to get an exact number and i tried to tune the precision thershold as have suggested. But the issue that i face with this approach is the discrepancy between the unique number of users in cities in Country compared to the total in the country.
so for example as shown in the result had 434430 users and 2 cities 222287 and 188695(the total number of users from cities is larger than the country), so i wanted to understand how this can happen?

as for the result, we had a user_index that contains the information from registered users and I used terms without cardinality and the results make sense

thanks again

Hendrik_Muhs · May 14, 2020, 7:45am

I think I got your question now. Sorry, I misunderstood in the beginning.

How many shards does the index have you query?

I wonder if the problem might be that the top-n is different for every shard. When you do the aggregation and limit the size to 2. The aggregation will ask every shard for results and combine it. The number of buckets it asks for, for every shard is called shard_size. It's not 2 as explained here, but it might be worth a try to increase the parameter and see what happens.

Nisrein_Sada · May 14, 2020, 10:23am

Thank you again Hendrik
this index has 5 shards and i tried tweaking the size parameter so i think i might have data extraction issue in the index

system · June 11, 2020, 10:23am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch aggregation not matching with unique count metrics Elasticsearch	4	1648	December 29, 2017
Beginner - How to aggregate count on large cardinality? Elasticsearch	5	426	April 1, 2019
Cardinality computation Elasticsearch	13	1981	July 5, 2017
Get number of unique values in a field Elasticsearch	3	1041	July 6, 2017
Aggregating the distinct result to a grouping Elasticsearch	3	992	July 5, 2017

Get the aggregation that measures unique users from different cities using elasticsearch event log index

Related topics