How to determine the correct size for terms aggregation, which will produce accurate aggregation results?

As I read through document for Terms Aggregation, I came across the fact that the results from Term Aggregation are not always accurate, but we can increase the size to get the accurate results.

I know : -

  1. How Query-Then-Fetch works.
  2. How top terms are calculated at each shard(shard_size) and then merge at co-ordinator node(size).
  3. What "doc_count_error_upper_bound" means, and how it can help in determining that there may be error in top results and we need to increase the size.

But is there any mathematical approach or any other way, with help of which we can determine the correct size that we should ask for once we get in-accurate results for the first time?

Maybe no,

Suppose you have 3 primary shards and size=10, shard_size=25, and the 10th term 'A' in the result were contained only in the result of 2 primary shards.
If the count of the 25th term in the shard not returning 'A' is N, the count of 'A' in that shard should be any value between 0 and N. Then doc_count_error_upper_bound is N (or N-1, I'm not sure). doc_count_error_upper_bound is the sum of threasholds of shards not returning the bucket of that specific term.

About the rank of 'A' in such shards, there is no information contained in the response. Therefore, there is no way to calculate the necessary size or shard_size for accurate result. The only way to guarantee zero error should be use shard_size more than the terms cardinality.

@Tomo_M Thanks for replying!!

The only way to guarantee zero error should be use shard_size more than the terms cardinality.

Here what do you mean by terms cardinality ? Is it the count of unique terms, or count of all the terms?

I meant the count of unique terms.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.