Elasticsearch terms aggregation with partition does not retuning equal bucket

I have an ES index and I fetch data using term aggregation and partition.

I am using the cardinality aggregation to get the total number of buckets and calculating num_partitions, using default size values 10.

E.g. Cardinality aggregation returns the count 3655.

partition size = 3655/10 = 366

When I execute term aggregation with num_partitions = 366, size = 10 it returns me the following number of buckets in the requests = 6, 2, 10, 8 10, 10, 9,... 10

the sum of all buckets is less than 3665 and even after the last page, it skips many data.

I would expect it to go like this 10,10,10,10,.....5

So, the question is -

  • Why each partition does not include 10 buckets as per size?

This slight unevenness is to be expected given the use of the widely used hash partitioning technique.

If the results across all partitions donā€™t add up then that would be a bug - but I doubt that is the case. You may have overlooked an error bounds that was reported with some of the terms agg results.

Hi @Mark_Harwood, I think this is a bug. For Every index, I m facing the same issue. Below are some more cases that return a different number of buckets for the same index.

Cardinality aggregation returns count: 500

Case 1
size: 500
num_partitions: 1
return total buckets: 430

Case 2
size: 250
num_partitions: 2
return total buckets: 217 + 213 = 430

Case 3
size: 100
num_partitions: 5
return total buckets: 72+79+80+100+94 = 425

Case 4
size: 50
num_partitions: 10
return total buckets: 37+34+39+50+48+35+45+41+48+46 = 423

And as I increase the number of partitions the sum of buckets across all partitions is very less.

Note in cases 3 and 4 there are partitions returning exactly the number of buckets requested in your ā€˜sizeā€™ parameter.
That is a strong indication that thereā€™s > N buckets to be returned but you only asked to see N

Okay, I understood. But First I calculated no. of partitions based on the total count and size so that it fits total data. Can you suggest then what should be the best way to calculate the no. of partitions so that it does not skip any bucket When I have total count and size.

Calculate the number of partitions so that youā€™re working with a manageable subset of the data for the analysis you want to do.
Too small a partition number = many unique keys per partition = potential overload of memory or inaccuracies
Too large a partition number = many client requests required = slow

Whatever partition number you settle on, up the ā€˜sizeā€™ used in the request. E.g if you think the partition will produce roughly 1000 keys per partition set the size to 1500 to allow for the unevenness we talked about previously. Setting it to 1000 exactly would be a mistake for the reason I outlined in my last comment.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.