Aggregation query size?

Hello ElasticSearch experts,

I have about 100M documents and want to run aggregation query on a random subset (500k) documents on their features/terms (for query to ElasticSearch, I can input the random 500k document as a filter), and wondering if there is a limit (or a limit we can configure) on how big the subset is for one aggregation query?

thanks in advance,
Lin

There's no technical limit to aggregation size, but you may run into practical limitations due to memory (depending on how you structure your aggregation, and if you are using fielddata vs docvalues).

As far as limiting the size, that is generally accomplished through various mechanisms to limit the "scope" the aggregation is run on. For example, you can filter the search to reduce the size, or you can limit the number of terms that a terms agg will calculate, or use the Sampler aggregation to sub-sample a context, etc.

If you can share a bit more details I may be able to help more specifically

1 Like

@polyfractal, if I need to use Sampler aggregation, it is applied on whole documents? Could I apply Sampler aggregation on a any subset of documents (and the subset of documents could be dynamically input from query)? Thanks.

Yep! The Sampler agg is just an aggregation itself, so you could embed it inside of a Filter or Filters Aggregation for example. It would then "sample" all the docs that match the filter (which is itself filtering all the docs that match the query).

Or you could embed it inside of a terms aggregation, which means it would "sample" documents inside of each bucket of terms

Yes, it has access to the entire document. You specify the field you want to "sample". So you might filter/aggregate on title but sample on category for example.

1 Like

@polyfractal, thanks a lot! Sample aggregation seems powerful. :smile:

Have a good day.