Hi,
I am searching for a way to count the number of buckets matching a certain criterion.
Contrary to Aggregegation buckets count the cardinality aggregation does not work in my case.
An example to make it clearer:
I need the number of values (for a non-analyzed field) that exist only once (alternatively also more than once would be fine, as the number of distinct values is easy to retrieve).
Let's say, my set contains a, a, b, a, c, c, d, e, e => then I want "2" as a result (for b and d are unique)
The cardinality would return the distinct count, i.e. 5 (also including a, c and e).
I could do a terms aggregation with min_doc_count set to 2 - then I would get 3 buckets (a, c, e), and subtract this count from the cardinality. But that's far too expensive, I neither need nor want the values of these buckets.
The terms aggregation itself only contains doc_count and sum_other_doc_count, the bucket_count can be infered from the result, but there is no sum_other_bucket_count (in this case I could limit the size to 1 (as 0 means all) and have only 1 additional bucket with information that I do not need (still better than a million).
For any pipeline aggregation, I also need the aggregation with all the buckets first.
I would also be happy with a query allowing to apply such a filter - then I could use the filter aggregation to get only uniquely appearing hits and count them.
My preferred solution (also for other use cases) would be to have a Metrics Aggregation returning the number of buckets for an aggregation without returning the values for the buckets.
Hmm, this is a tricky one. The only way I can think to do it is like this:
Terms aggregation, set to a size large enough to encompass all your terms. This step could be very expensive in terms of memory, although not too bad if you aren't deeply nesting this aggregation (e.g. if it is a top-level agg). I wouldn't attempt it if the agg is deeply nested, as the branching factor will crush your heap
min_doc_count: 1 on the terms agg to ensure you get buckets that have at least one doc
bucket_selector pipeline which removes any bucket that doesn't have count == 1
sum_bucket pipeline to add up the count of the remaining buckets
filter_path to only show the final metric, instead of sending back all the buckets which match the criteria
This gets funky if fields are multi-valued. E.g. if "body": ["a","b"], the document will be placed into both buckets "a" and "b", which may not be what you're looking for.
If you have a very large long-tail of values with only one doc, all those buckets need to be serialized and streamed to the coordinating node for merging. So there's also a network and some CPU overhead here too.
I'll poke some people about if a max_doc_count might be possible on the terms aggregation.
Thank you for the quick response.
In my opinion, the main drawback is the processing of all buckets - as in "all_terms" in your example.
In our main use case, the field we are talking about is also the routing parameter - therefore there is no need to transmit values between shards; the counts are sufficient in this case, and there is no overlap between shards.
The additions that would probably help most:
a total bucket count in the response (then I can cut after the first x buckets and still not loose the total number)
min_doc_count (and possibly also a max_doc_count) on the cardinality aggregation - though that might be hard to implement
cardinality can not only get a field but alternatively an aggregation builder - and returns the count of the bucket values instead of doc values in this case
a countBucket aggregation similar to the filter aggregation; but taking a bucket aggregation as parameter instead of a queryBuilder
*...
Something like that would be cool.
Multi-value fields are no problem in our case; they are non-analyzed.
The long-tail is a serious concern, there are really a lot of values. Unfortunatlely not such a high percentage that counting buckets of size >=2 solves the problem.
Of course we could also use native scripts and implement the behavior there - but the processing time is too slow even without actually processing the values. Within the internal ES data structures there should be faster ways to determine such things...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.