I am looking to extract top 100 most significant terms out of a very large index. In particular, terms that have increased the most significantly in frequency between some defined "background set" and "foreground set" of documents. The queries also need to be quite fast, ideally less than a couple seconds. For this reason, I think I will need to sample rather than consider all documents.
I thought the solution was to use the "sampler" aggregation, and then "significant_terms" as a sub-aggregation. However, it seems that sampler only samples the documents in the "foreground set" and not documents in the "background set" as specified by the "background_filter" parameter. Without sampling, the background set of documents is still very large, so the overall query is still quite slow.
I'm wondering if there is any workaround to this issue, or perhaps some other solution entirely.