I have an ES index and I fetch data using term aggregation and partition.
I am using the cardinality aggregation to get the total number of buckets and calculating num_partitions, using default size values 10.
E.g. Cardinality aggregation returns the count 3655.
partition size = 3655/10 = 366
When I execute term aggregation with num_partitions = 366, size = 10 it returns me the following number of buckets in the requests = 6, 2, 10, 8 10, 10, 9,... 10
the sum of all buckets is less than 3665 and even after the last page, it skips many data.
I would expect it to go like this 10,10,10,10,.....5
So, the question is -
Why each partition does not include 10 buckets as per size?
This slight unevenness is to be expected given the use of the widely used hash partitioning technique.
If the results across all partitions donāt add up then that would be a bug - but I doubt that is the case. You may have overlooked an error bounds that was reported with some of the terms agg results.
Hi @Mark_Harwood, I think this is a bug. For Every index, I m facing the same issue. Below are some more cases that return a different number of buckets for the same index.
Cardinality aggregation returns count: 500
Case 1
size: 500
num_partitions: 1
return total buckets: 430
Case 2
size: 250
num_partitions: 2
return total buckets: 217 + 213 = 430
Case 3
size: 100
num_partitions: 5
return total buckets: 72+79+80+100+94 = 425
Case 4
size: 50
num_partitions: 10
return total buckets: 37+34+39+50+48+35+45+41+48+46 = 423
And as I increase the number of partitions the sum of buckets across all partitions is very less.
Note in cases 3 and 4 there are partitions returning exactly the number of buckets requested in your āsizeā parameter.
That is a strong indication that thereās > N buckets to be returned but you only asked to see N
Okay, I understood. But First I calculated no. of partitions based on the total count and size so that it fits total data. Can you suggest then what should be the best way to calculate the no. of partitions so that it does not skip any bucket When I have total count and size.
Calculate the number of partitions so that youāre working with a manageable subset of the data for the analysis you want to do.
Too small a partition number = many unique keys per partition = potential overload of memory or inaccuracies
Too large a partition number = many client requests required = slow
Whatever partition number you settle on, up the āsizeā used in the request. E.g if you think the partition will produce roughly 1000 keys per partition set the size to 1500 to allow for the unevenness we talked about previously. Setting it to 1000 exactly would be a mistake for the reason I outlined in my last comment.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.