Time filtered aggregations with very high cardinality

Metheny · December 27, 2020, 6:49am

We have an index with a large amount of user reports (millions / 10s of millions). Assuming most users have reports, we need to calculate some statistics per-user with an input time range.
For example, how many users have an average of between 10-15, 15-20, 20-30 reports per week in a specific time interval.
Please note that we don't need to return the buckets themselves in the response, but they need to be evaluated by the sub aggregations that calculate the average & ranges. To my understanding, elasticsearch has a limit to the number of buckets created, and it's not recommended to increase it to millions (plus our ES is hosted in AWS which doesn't allow changing this parameter)
I've read about the composite aggregation and the partitioning in the terms aggregation used for pagination, but it won't help since the client might need to make thousands of calls to elasticsearch and this might take a very long time (hours).
Also read about entity-centric indices, but it seems that since our query includes a dynamic time filter we won't be able to prepare the data in advance since we don't know which time range will be queried.

Below is a simplified version of our current query. We want to calculate the number of users that have between X1-X2 monthly reports between 2 dates.

bucket the reports by user id.
use bucket selector to select only the users which have between Y-Z reports (Y1 & Y2 are pre-calculated by the client - these are the numbers which will resolve to an average of between X1-X2 monthly reports).
count the number of buckets left

The problem is that the original bucketing (terms aggregation) will only return a relatively small amount of buckets (not millions), so only a small amount of users will be evaluated at all.
What would be the best way to achieve this?

    POST /reports/_search
    {
         "size": 0,
         "query": {
             "range": {
                 "timestamp": {
                     "gte": "2020-01-01T00:00:00.000Z",
                     "lte": "2020-12-24T23:59:59.999Z",
                     "format": "strict_date_optional_time"
                 }
             }
         },
         "aggs": {
             "distinctIds_less_than_monthly": {
                 "terms": {
                     "field": "userId" // this will only return a small amount of buckets
                     "size": 10000, // this is still very small relative to the number of users
                 },
                 "aggs": {
                     "less_than_monthly": {
                         "bucket_selector": {
                             "buckets_path": {
                                 "distinctUsers": "distinctUsers_less_than_monthly.value"
                             },
                             "script": "params.distinctUsers > 1000 && params.distinctUsers < 1500"
                         }
                     },
                     "distinctUsers_less_than_monthly": {
                         "value_count": {
                             "field": "userId"
                         }
                     }                 
                 }
             },
             "userCount_less_than_monthly": {
                 "stats_bucket": {
                     "buckets_path": "distinctIds_less_than_monthly._count"
                 }
             }
         }
    }

system · January 24, 2021, 6:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Easy way to get a terms bucket count stats? Elasticsearch	6	1119	January 3, 2020
How to create buckets for number of occurances Elasticsearch	1	326	October 21, 2019
Filtering on aggregated values Elasticsearch	3	349	January 9, 2019
Elasticsearch aggregations for analytics Elasticsearch	1	295	July 6, 2017
Too many aggregation buckets Elasticsearch	3	3079	July 5, 2017

Time filtered aggregations with very high cardinality

Related topics