Get the aggregation that measures unique users from different cities using elasticsearch event log index

I am trying to measure the number of users from different countries by reading the events_log_index ES, this index has all the user interactions from a mobile. Therefore the user might have several records in the event index.

some of the fields in the mapping include:

  • sid: session_id
  • uid: user_id
  • location: nested object containing the users country and city
    {
    "properties": {
        "city": {
            "type": "keyword"
        },
        "country_long": {
            "type": "keyword"
        },
        "country_short": {
            "type": "keyword"
        },
        "ip": {
            "type": "keyword"
        },
        "loc": {
            "type": "geo_point"
        },
        "region": {
            "type": "keyword"
        }
    }
}

I want to know the number of unique users from different countries and then different cities in these countries, so to get the number of users from different countries I used the following query:

   {
    "size": 0,
    "query": {
            "match_all":{}
        },
    "aggs" : {
        "users_per_country" : {
            "terms" : {
                  "field": "location.country_long",
			        "size": 2,
			        "order": {
			          "_count": "desc"
			        }
            },
            "aggs": {
                "distinct_users": {
                    "cardinality": {
                        "field": "uid"
                    }
                }
            }
        }
    }
}

The results were as follows:

{
    "took": 3702,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "users_per_country": {
            "doc_count_error_upper_bound": 14782,
            "sum_other_doc_count": 847455,
            "buckets": [
                {
                    "key": "Egypt",
                    "doc_count": 35337333,
                    "distinct_users": {
                        "value": 434430
                    }
                },
                {
                    "key": "Jordan",
                    "doc_count": 4649252,
                    "distinct_users": {
                        "value": 65504
                    }
                }
            ]
        }
    }
}

So then I wanted to check for each city in one of these countries the distinct number of users, so I used:

{
    "size": 0,
    "query": {
            "term": {
                "location.country_long": "Egypt"
            }
        },
    "aggs" : {
        "users_per_country" : {
            "terms" : {
                  "field": "location.city",
			        "size": 5,
			        "order": {
			          "_count": "desc"
			        }
            },
            "aggs": {
                "distinct_users": {
                    "cardinality": {
                        "field": "uid"
                    }
                }
            }
        }
    }
}

the result was as follows:

{
    "took": 3034,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "users_per_country": {
            "doc_count_error_upper_bound": 236846,
            "sum_other_doc_count": 10455532,
            "buckets": [
                {
                    "key": "Al Jizah",
                    "doc_count": 8552318,
                    "distinct_users": {
                        "value": 222287
                    }
                },
                {
                    "key": "Cairo",
                    "doc_count": 6923425,
                    "distinct_users": {
                        "value": 188695
                    }
                },
                {
                    "key": "Al Khankah",
                    "doc_count": 4539083,
                    "distinct_users": {
                        "value": 151461
                    }
                },
                {
                    "key": "Alexandria",
                    "doc_count": 3901927,
                    "distinct_users": {
                        "value": 125361
                    }
                },
                {
                    "key": "Al Mansurah",
                    "doc_count": 965048,
                    "distinct_users": {
                        "value": 36865
                    }
                }
            ]
        }
    }
}

The main issue is that comparing the total number of users from Egypt to the sum of users in each bucket from the city does not add up, so you can please advise how to construct this query to get the desired result?

The cardinality aggregation returns an approximate count in order to scale. You can tweak the precision, however it will remain approximate.

To get exact counts you can use a composite aggregation and group by user id.

Another alternative, which is based on composite aggregations, is an entity centric index. In a nutshell you create a 2nd index build around your users, so you turn single events into sessions grouped by user id. This will give you a entity focused view on users, which is useful to answer all kinds of questions. The feature is called transform, I suggest this link to get an idea.

(An even easier start might be, to watch this recording)

1 Like

Thank Hendrik_Muhs for your reply

Can you please explain how composite aggregations will help me in getting the unique number of users per city because what I understand that it is equivalent to group by 'city' and 'userId' which is not the intended purpose? can you please provide me with an example ?

Thank u for the link for the transform feature i was not aware and i will try it?

Yes, I was thinking of grouping by uid and city to replace the cardinality aggregation which by design returns an approximate count. However, with that I see, you loose the ability to get a count by city only. Maybe tweaking the precision_threshold is the better option for you. Note that even with the default the precision error is quite low. Is it a hard requirement to get exact numbers?

Thank you hendrik
it is not a hard requirement to get an exact number and i tried to tune the precision thershold as have suggested. But the issue that i face with this approach is the discrepancy between the unique number of users in cities in Country compared to the total in the country.
so for example as shown in the result had 434430 users and 2 cities 222287 and 188695(the total number of users from cities is larger than the country), so i wanted to understand how this can happen?

as for the result, we had a user_index that contains the information from registered users and I used terms without cardinality and the results make sense

thanks again

I think I got your question now. Sorry, I misunderstood in the beginning.

How many shards does the index have you query?

I wonder if the problem might be that the top-n is different for every shard. When you do the aggregation and limit the size to 2. The aggregation will ask every shard for results and combine it. The number of buckets it asks for, for every shard is called shard_size. It's not 2 as explained here, but it might be worth a try to increase the parameter and see what happens.

Thank you again Hendrik
this index has 5 shards and i tried tweaking the size parameter so i think i might have data extraction issue in the index

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.