As I read through document for Terms Aggregation, I came across the fact that the results from Term Aggregation are not always accurate, but we can increase the size to get the accurate results.
I know : -
How Query-Then-Fetch works.
How top terms are calculated at each shard(shard_size) and then merge at co-ordinator node(size).
What "doc_count_error_upper_bound" means, and how it can help in determining that there may be error in top results and we need to increase the size.
But is there any mathematical approach or any other way, with help of which we can determine the correct size that we should ask for once we get in-accurate results for the first time?
Suppose you have 3 primary shards and size=10, shard_size=25, and the 10th term 'A' in the result were contained only in the result of 2 primary shards.
If the count of the 25th term in the shard not returning 'A' is N, the count of 'A' in that shard should be any value between 0 and N. Then doc_count_error_upper_bound is N (or N-1, I'm not sure). doc_count_error_upper_bound is the sum of threasholds of shards not returning the bucket of that specific term.
About the rank of 'A' in such shards, there is no information contained in the response. Therefore, there is no way to calculate the necessary size or shard_size for accurate result. The only way to guarantee zero error should be use shard_size more than the terms cardinality.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.