How to know the total number of aggregation result buckets

Anindya_Roy · April 11, 2019, 3:14pm

I have a multi-level aggregation query - that results nested buckets. In order to retrieve all the results I am currently using the the total doc hits number and then divide that by partition_size = 1000, to get the results hashed across x partitions ( x= total[hits]/partition_size) and retrieving the the aggregation results from each partition id. We are using the include clause to use the partitioning approach.

We are using what we think is a safe approach by using total[hits] to derive the total number of partitions, by using a fixed number of partition_size. This is because we do not know if there is a way to accurately know before hand how many aggregation buckets are in the result.

Is there a better approach ?

Mark_Harwood · April 11, 2019, 3:21pm

Hi.

Not sure what you mean by this. Can you share some examples of your request?
Have you ruled out the composite aggregation already?

Anindya_Roy · April 11, 2019, 3:58pm

i am not doing histograms, so I do not need the results to be ordered, so I am not using composite aggregation. In my code I address the fact the results will come out of order.

I am using the partitioning feature provided in the 'include' clause of the 'terms' aggregation.
In a multi-level aggr query - suppose the search hits results in 20K docs, then I derive num_partitions as 20 and retrieve the results from partition id 0 through 19.

But it is possible that some of the partitions may not have any results at all with this approach, since hits results does not indicate how many nested buckets are actually generated by the result.

So say. the the total results were 22, and i spread it across 11 partitions, it is possible that only 2 out of 11 partitions have actual results that accounts to the 22 docs.

Mark_Harwood · April 11, 2019, 4:04pm

My bad - I thought for a moment you were listing raw terms in an include array. I'd forgotten partition numbers were also passed in the include clause.

I think you need to use the cardinality aggregation to figure out how many partitions are required rather than number of hits. The cardinality aggregation should tell you roughly how many unique values there are for a field.

Anindya_Roy · April 11, 2019, 4:11pm

i was initially using cardinality query to query the unique docs on the top field used at the first level aggregation. But for some use cases, this approach will not consider all docs, and i will miss buckets.

so i am using the total results approach - to be safe. ofcourse there are some blank fires...if some partitions do not have any results..but otherwise you are blessing this approach ..right ?

Mark_Harwood · April 11, 2019, 4:34pm

Safe in terms of not blowing up memory or safe in terms of producing accurate results?

If you're doing a multi-level agg this complicates matters because a simple cardinality agg won't give us a sense of how balanced the tree is. Some branches could have significantly more values than others so we can't assume the number of leaves we might find in any branch. That makes picking partition sizes for the different levels more complex.

Generally speaking, if you set size to return N buckets and the number of buckets returned is less than N then you know that your partition size is safe (accuracy wise) - all terms have been fully considered in that request. If the number of buckets returned is exactly N then you should pay attention to the reported error margins - we would likely be dropping some terms as part of the computation meaning some data may not have been considered in the reduction.

Anindya_Roy · April 11, 2019, 4:50pm

safe in terms of producing accurate results.

What is your recommendation on specifying "doc_count_error_upper_bound": 0 in the clause of the aggregation query in the top parent ?

Mark_Harwood · April 11, 2019, 4:59pm

This is a value that is reported by the engine, not provided by you.
It tells you at most how many docs might have been ignored for returned terms. If you're interested in sums or averages of values held on these docs then of course your figures will be off by the amounts on those missing docs. By focusing analysis of each request on a suitably reduced set of terms (using partitioning) you should see the doc_count_error_upper_bound result head towards zero.

Anindya_Roy · April 11, 2019, 5:05pm

agreed. thanks.

system · May 9, 2019, 5:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregation Partition Number Elasticsearch	7	766	September 27, 2019
Elasticsearch terms aggregation with partition does not retuning equal bucket Elasticsearch	6	1179	April 11, 2022
During terms aggregation with partition we get sum_other_doc_count > 0 in between partitions Elasticsearch	4	346	January 10, 2023
Elasticsearch terms aggregation with partition does not honor the “size” value Elasticsearch	5	1841	May 25, 2021
Counting the number of buckets matching certain criteria Elasticsearch	3	8539	July 5, 2017

How to know the total number of aggregation result buckets

Related topics