There is a set of terms. I want to know which ones are present in the search results and I use term aggregation for this purpose.
Is there some way to limit the number of documents for each bucket in term aggregation in order to improve performance?
All I want to know is if there is at least one document containing term. I don't need to count the number of documents for each term.
Or maybe there is another more efficient way to solve this problem?
Thank you.
What's the business question you're trying to answer with this request?
If your goal is to count the number of unique terms, use the cardinality aggregation.
If you want the most popular terms from a large set of unique terms - just faster - then skipping counting some terms may just lead to inaccuracies as to what are the correct subset of terms to select for the final result.
If the number of unique terms in the index is small (so you always return the full set rather than just the top N) we'd need to terminate the search's collection of docs only after a defined number of terms had been discovered - but there's no way for you to provide what that expected number is (or perhaps for you to know what to expect). There's no way for us to know that there's not an extra term to be found at the end of the very long stream of docs that match a query so it's hard to add a shortcut.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.