I have a multi-level aggregation query - that results nested buckets. In order to retrieve all the results I am currently using the the total doc hits number and then divide that by partition_size = 1000, to get the results hashed across x partitions ( x= total[hits]/partition_size) and retrieving the the aggregation results from each partition id. We are using the include clause to use the partitioning approach.
We are using what we think is a safe approach by using total[hits] to derive the total number of partitions, by using a fixed number of partition_size. This is because we do not know if there is a way to accurately know before hand how many aggregation buckets are in the result.
i am not doing histograms, so I do not need the results to be ordered, so I am not using composite aggregation. In my code I address the fact the results will come out of order.
I am using the partitioning feature provided in the 'include' clause of the 'terms' aggregation.
In a multi-level aggr query - suppose the search hits results in 20K docs, then I derive num_partitions as 20 and retrieve the results from partition id 0 through 19.
But it is possible that some of the partitions may not have any results at all with this approach, since hits results does not indicate how many nested buckets are actually generated by the result.
So say. the the total results were 22, and i spread it across 11 partitions, it is possible that only 2 out of 11 partitions have actual results that accounts to the 22 docs.
My bad - I thought for a moment you were listing raw terms in an include array. I'd forgotten partition numbers were also passed in the include clause.
I think you need to use the cardinality aggregation to figure out how many partitions are required rather than number of hits. The cardinality aggregation should tell you roughly how many unique values there are for a field.
i was initially using cardinality query to query the unique docs on the top field used at the first level aggregation. But for some use cases, this approach will not consider all docs, and i will miss buckets.
so i am using the total results approach - to be safe. ofcourse there are some blank fires...if some partitions do not have any results..but otherwise you are blessing this approach ..right ?
Safe in terms of not blowing up memory or safe in terms of producing accurate results?
If you're doing a multi-level agg this complicates matters because a simple cardinality agg won't give us a sense of how balanced the tree is. Some branches could have significantly more values than others so we can't assume the number of leaves we might find in any branch. That makes picking partition sizes for the different levels more complex.
Generally speaking, if you set size to return N buckets and the number of buckets returned is less than N then you know that your partition size is safe (accuracy wise) - all terms have been fully considered in that request. If the number of buckets returned is exactly N then you should pay attention to the reported error margins - we would likely be dropping some terms as part of the computation meaning some data may not have been considered in the reduction.
This is a value that is reported by the engine, not provided by you.
It tells you at most how many docs might have been ignored for returned terms. If you're interested in sums or averages of values held on these docs then of course your figures will be off by the amounts on those missing docs. By focusing analysis of each request on a suitably reduced set of terms (using partitioning) you should see the doc_count_error_upper_bound result head towards zero.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.