We have an index with a large amount of user reports (millions / 10s of millions). Assuming most users have reports, we need to calculate some statistics per-user with an input time range.
For example, how many users have an average of between 10-15, 15-20, 20-30 reports per week in a specific time interval.
Please note that we don't need to return the buckets themselves in the response, but they need to be evaluated by the sub aggregations that calculate the average & ranges. To my understanding, elasticsearch has a limit to the number of buckets created, and it's not recommended to increase it to millions (plus our ES is hosted in AWS which doesn't allow changing this parameter)
I've read about the composite aggregation and the partitioning in the terms aggregation used for pagination, but it won't help since the client might need to make thousands of calls to elasticsearch and this might take a very long time (hours).
Also read about entity-centric indices, but it seems that since our query includes a dynamic time filter we won't be able to prepare the data in advance since we don't know which time range will be queried.
Below is a simplified version of our current query. We want to calculate the number of users that have between X1-X2 monthly reports between 2 dates.
- bucket the reports by user id.
- use bucket selector to select only the users which have between Y-Z reports (Y1 & Y2 are pre-calculated by the client - these are the numbers which will resolve to an average of between X1-X2 monthly reports).
- count the number of buckets left
The problem is that the original bucketing (terms aggregation) will only return a relatively small amount of buckets (not millions), so only a small amount of users will be evaluated at all.
What would be the best way to achieve this?
POST /reports/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2020-01-01T00:00:00.000Z",
"lte": "2020-12-24T23:59:59.999Z",
"format": "strict_date_optional_time"
}
}
},
"aggs": {
"distinctIds_less_than_monthly": {
"terms": {
"field": "userId" // this will only return a small amount of buckets
"size": 10000, // this is still very small relative to the number of users
},
"aggs": {
"less_than_monthly": {
"bucket_selector": {
"buckets_path": {
"distinctUsers": "distinctUsers_less_than_monthly.value"
},
"script": "params.distinctUsers > 1000 && params.distinctUsers < 1500"
}
},
"distinctUsers_less_than_monthly": {
"value_count": {
"field": "userId"
}
}
}
},
"userCount_less_than_monthly": {
"stats_bucket": {
"buckets_path": "distinctIds_less_than_monthly._count"
}
}
}
}