Elasticsearch terms aggregation with partition does not retuning equal bucket

truptir · March 10, 2022, 2:58pm

I have an ES index and I fetch data using term aggregation and partition.

I am using the cardinality aggregation to get the total number of buckets and calculating num_partitions, using default size values 10.

E.g. Cardinality aggregation returns the count 3655.

partition size = 3655/10 = 366

When I execute term aggregation with num_partitions = 366, size = 10 it returns me the following number of buckets in the requests = 6, 2, 10, 8 10, 10, 9,... 10

the sum of all buckets is less than 3665 and even after the last page, it skips many data.

I would expect it to go like this 10,10,10,10,.....5

So, the question is -

Why each partition does not include 10 buckets as per size?

Mark_Harwood · March 10, 2022, 10:44pm

This slight unevenness is to be expected given the use of the widely used hash partitioning technique.

If the results across all partitions don’t add up then that would be a bug - but I doubt that is the case. You may have overlooked an error bounds that was reported with some of the terms agg results.

truptir · March 11, 2022, 2:07pm

Hi @Mark_Harwood, I think this is a bug. For Every index, I m facing the same issue. Below are some more cases that return a different number of buckets for the same index.

Cardinality aggregation returns count: 500

Case 1
size: 500
num_partitions: 1
return total buckets: 430

Case 2
size: 250
num_partitions: 2
return total buckets: 217 + 213 = 430

Case 3
size: 100
num_partitions: 5
return total buckets: 72+79+80+100+94 = 425

Case 4
size: 50
num_partitions: 10
return total buckets: 37+34+39+50+48+35+45+41+48+46 = 423

And as I increase the number of partitions the sum of buckets across all partitions is very less.

Mark_Harwood · March 11, 2022, 2:41pm

Note in cases 3 and 4 there are partitions returning exactly the number of buckets requested in your ‘size’ parameter.
That is a strong indication that there’s > N buckets to be returned but you only asked to see N

truptir · March 14, 2022, 2:29pm

Okay, I understood. But First I calculated no. of partitions based on the total count and size so that it fits total data. Can you suggest then what should be the best way to calculate the no. of partitions so that it does not skip any bucket When I have total count and size.

Mark_Harwood · March 14, 2022, 8:03pm

Calculate the number of partitions so that you’re working with a manageable subset of the data for the analysis you want to do.
Too small a partition number = many unique keys per partition = potential overload of memory or inaccuracies
Too large a partition number = many client requests required = slow

Whatever partition number you settle on, up the ‘size’ used in the request. E.g if you think the partition will produce roughly 1000 keys per partition set the size to 1500 to allow for the unevenness we talked about previously. Setting it to 1000 exactly would be a mistake for the reason I outlined in my last comment.

system · April 11, 2022, 8:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch terms aggregation with partition does not honor the “size” value Elasticsearch	5	1919	May 25, 2021
During terms aggregation with partition we get sum_other_doc_count > 0 in between partitions Elasticsearch	4	400	January 10, 2023
Missing documents when using partitions on a term aggregation Elasticsearch	3	582	July 16, 2020
How to know the total number of aggregation result buckets Elasticsearch	9	819	May 9, 2019
Running cardinality for more than 10000 buckets Elasticsearch	14	2954	August 28, 2019

Elasticsearch terms aggregation with partition does not retuning equal bucket

Related topics