How are the documents selected by the Sampler Aggregation

Dominik_Stadler · April 5, 2016, 3:22pm

Hi,

we are looking at using the https://www.elastic.co/guide/en/elasticsearch/reference/2.1/search-aggregations-bucket-sampler-aggregation.html for limiting the run-time of aggregations on large data sets.

What we could not find in the documentation is how the documents are selected? Are they sorted in some way before being sampled or is the order of documents and thus the actual documents used for the aggregation deterministic in any way?

We would like to ensure that the samples are selected in a good random order so that things like a TopX aggregation still returns useful information with high probability.

Thanks... Dominik.

colings86 · April 6, 2016, 7:23am

For the sampler aggregation the documents selected for the sample are the N top scoring documents (where score is defined by the query) from each shard (where N is the sample size). The diversified_sampler aggregation also selects the N top scoring documents but limits the sample to only X documents for each value of the field you select for diversification.

Hope that helps

Dominik_Stadler · April 6, 2016, 10:59am

Thanks for the explanation, does that mean I could use something like https://www.elastic.co/guide/en/elasticsearch/guide/current/random-scoring.html to ensure a pseudo-random selection based on the seed that I pass in?

I.e. with the same seed I get the same set sampled and with pure-random seed I get a good random distribution of sampled documents?

colings86 · April 6, 2016, 12:12pm

Yes you should be able to do that, although I haven't tested that out myself

Dominik_Stadler · April 6, 2016, 12:24pm

I quickly tested it, seems to work, however it has quite a performance impact and so invalidates the benefit that we tried to get from the SamplingAggregation in the first place as it then takes at least as long as doing a full aggregation without sampling anyway

Topic		Replies	Views
Sampling aggregation with a fixed seed producing unstable results Elasticsearch	2	294	May 23, 2023
Sampler aggregation fails to optimize queries Elasticsearch	8	432	November 4, 2019
Get a fixed random sample from all documents Elasticsearch	2	2799	July 5, 2017
Sampler aggregation performance vs 2 queries Elasticsearch	5	946	January 18, 2018
Sampler aggregration overhead Elasticsearch	11	535	February 25, 2020

How are the documents selected by the Sampler Aggregation

Related topics