Significant_terms aggregation with sampling

colinvdh · December 23, 2022, 7:13pm

Hi there,

I am looking to extract top 100 most significant terms out of a very large index. In particular, terms that have increased the most significantly in frequency between some defined "background set" and "foreground set" of documents. The queries also need to be quite fast, ideally less than a couple seconds. For this reason, I think I will need to sample rather than consider all documents.

I thought the solution was to use the "sampler" aggregation, and then "significant_terms" as a sub-aggregation. However, it seems that sampler only samples the documents in the "foreground set" and not documents in the "background set" as specified by the "background_filter" parameter. Without sampling, the background set of documents is still very large, so the overall query is still quite slow.

I'm wondering if there is any workaround to this issue, or perhaps some other solution entirely.

Mark_Harwood1 · December 23, 2022, 10:19pm

Sampling typically improves performance and quality - it focuses on the highest ranking documents rather than the long tail of many potentially low-quality docs.
The main cost of significance scoring is that of looking up background frequencies for words found in the foreground. With sampling there are fewer docs considered and therefore fewer unique words to look up. Searches that match millions or billions of docs will need to use sampling.

You can also reduce the number of words checked for background frequencies by setting a ‘shard_min_doc_size’ setting to something like 3 or 4. Only words that occur that number of times in the foreground set will be looked up in the background set.

system · January 20, 2023, 10:20pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How is the score of Significant Term aggregation calculated? Elasticsearch	7	625	September 12, 2018
Aggregation across multiple indexes/indices - significant terms Elasticsearch	5	623	March 17, 2022
Significant Term aggregation Elasticsearch	9	624	July 6, 2017
Circuit breaker while using sampler? Elasticsearch	4	753	July 5, 2017
Significant terms aggregation on large dataset Elasticsearch	1	395	August 29, 2019

Significant_terms aggregation with sampling

Related topics