Requirement: calculate a sum aggregate on a high-cardinality field of 2M unique values, order the results by the sum, and return the top 10 (or pages of 10 each).
We are moving some of our materialized-view based data summarization queries from Postgres to Elasticsearch, and came across a need to aggregate a sum of on a high-cardinality field; basically think email addresses. We have around 2 million unique values for this field.
In our version 7.1 cluster, we're hit by the default limit of 10000 buckets based on the cluster setting for
search.max_bucket. We could of course update this to something higher, but getting to 2 million would probably put us in danger of bringing down the cluster. There is the possibility of many concurrent queries too.
We cannot do a
composite aggregation and paginate the buckets here because our requirement is to return the results ordered by sum - highest values first. So we have a
bucket_sort pipeline aggregation alongside the
composite as a means to do this, but hit a limit when the
size value in the
composite is over 10k.
I've seen some posts that generally say: "you can't do this in Elasticsearch. Instead pre-aggregate the data and dump it into a new entity-centric index".
Is that really the only approach when you come up against trying to post-process (e.g. sort) aggregated data that has greater than 10k (or some reasonable limit) of unique buckets?