Significant_terms aggregation with sampling

Hi there,

I am looking to extract top 100 most significant terms out of a very large index. In particular, terms that have increased the most significantly in frequency between some defined "background set" and "foreground set" of documents. The queries also need to be quite fast, ideally less than a couple seconds. For this reason, I think I will need to sample rather than consider all documents.

I thought the solution was to use the "sampler" aggregation, and then "significant_terms" as a sub-aggregation. However, it seems that sampler only samples the documents in the "foreground set" and not documents in the "background set" as specified by the "background_filter" parameter. Without sampling, the background set of documents is still very large, so the overall query is still quite slow.

I'm wondering if there is any workaround to this issue, or perhaps some other solution entirely.

Sampling typically improves performance and quality - it focuses on the highest ranking documents rather than the long tail of many potentially low-quality docs.
The main cost of significance scoring is that of looking up background frequencies for words found in the foreground. With sampling there are fewer docs considered and therefore fewer unique words to look up. Searches that match millions or billions of docs will need to use sampling.

You can also reduce the number of words checked for background frequencies by setting a ‘shard_min_doc_size’ setting to something like 3 or 4. Only words that occur that number of times in the foreground set will be looked up in the background set.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.