What we could not find in the documentation is how the documents are selected? Are they sorted in some way before being sampled or is the order of documents and thus the actual documents used for the aggregation deterministic in any way?
We would like to ensure that the samples are selected in a good random order so that things like a TopX aggregation still returns useful information with high probability.
For the sampler aggregation the documents selected for the sample are the N top scoring documents (where score is defined by the query) from each shard (where N is the sample size). The diversified_sampler aggregation also selects the N top scoring documents but limits the sample to only X documents for each value of the field you select for diversification.
I quickly tested it, seems to work, however it has quite a performance impact and so invalidates the benefit that we tried to get from the SamplingAggregation in the first place as it then takes at least as long as doing a full aggregation without sampling anyway
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.