How are the documents selected by the Sampler Aggregation


we are looking at using the for limiting the run-time of aggregations on large data sets.

What we could not find in the documentation is how the documents are selected? Are they sorted in some way before being sampled or is the order of documents and thus the actual documents used for the aggregation deterministic in any way?

We would like to ensure that the samples are selected in a good random order so that things like a TopX aggregation still returns useful information with high probability.

Thanks... Dominik.

1 Like

For the sampler aggregation the documents selected for the sample are the N top scoring documents (where score is defined by the query) from each shard (where N is the sample size). The diversified_sampler aggregation also selects the N top scoring documents but limits the sample to only X documents for each value of the field you select for diversification.

Hope that helps

1 Like

Thanks for the explanation, does that mean I could use something like to ensure a pseudo-random selection based on the seed that I pass in?

I.e. with the same seed I get the same set sampled and with pure-random seed I get a good random distribution of sampled documents?

1 Like

Yes you should be able to do that, although I haven't tested that out myself

I quickly tested it, seems to work, however it has quite a performance impact and so invalidates the benefit that we tried to get from the SamplingAggregation in the first place as it then takes at least as long as doing a full aggregation without sampling anyway :frowning: