I've calculated unique counts for a dataset but I'd like to only display when those counts are above a certain number. On regular fields I would use {"min_doc_count": 13} under the Json input but on unique counts I get an error. Is there a way to limit the number that is returned?
Also if you know how to include fields with missing values into the unique count, I'd be appreciative.
@karussell Sorry for the delay, meant to reply to this sooner!
So I haven't looked at the script too closely, but a concern with this kind of cardinality aggregation is memory. E.g. collecting the counts in a simple map will be 100% accurate, but also has a very high memory burden because each shard will have to maintain a map of terms and then serialize that map to the coordinator.
As a toy example, consider 20 shards each with 10m unique terms. If all those terms are unique across shards (which isn't unusual if run against something like an IP address, user ID, etc) that will generate 200m unique terms which the coordinator needs to merge. Ignoring runtime speed of merging, if each term is ~10b large, that's 2gb in aggregation responses that the coordinator needs to hold in memory while reducing.
If there are a couple of those requests running in parallel, it's very easy to get to a point that the node runs out of memory.
That's why the Elasticsearch cardinality aggregator uses a HyperLogLog sketch to approximate cardinality, rather than calculate the true cardinality. In exchange for 1-5% error (depending), you can estimate cardinality in a few hundred kilobytes.
So that's the disclaimer, and why one should be careful with scripted-metrics aggs in general. We do a lot to make sure aggs have efficient runtime costs in both time and space, but scripted-metric lets you do anything you want. And it's easy to accidentally write a foot-gun
As a toy example, consider 20 shards each with 10m unique terms.
I know that I have much less than 100k terms in total, so memory shouldn't be an issue. And high precision count is important for this task. I was able to get rid of the NPE, when I skipped some "null" entries like so:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.