Display Minimum Count of Unique Count Field

uklipse · March 30, 2020, 9:15pm

I've calculated unique counts for a dataset but I'd like to only display when those counts are above a certain number. On regular fields I would use {"min_doc_count": 13} under the Json input but on unique counts I get an error. Is there a way to limit the number that is returned?

Also if you know how to include fields with missing values into the unique count, I'd be appreciative.

Nathan_Reese · March 31, 2020, 4:31pm

Looking at the cardinality aggregation documentation, it looks like min_doc_count is not a valid parameter.

Would you mind creating an enhancement request at https://github.com/elastic/elasticsearch/issues/new to add min_doc_count support to cardinality aggregation

karussell · April 2, 2020, 2:49pm

Interesting that I'm looking for the same feature at the same time .
Have created: https://github.com/elastic/elasticsearch/issues/54649

How could I implement this with a scripted_metric? Currently I'm getting a null pointer exception:

    "vehicle_count": {
      "scripted_metric": {
        "init_script":    "state.vehicles = [:]",
        "map_script":     "if(doc['vrp.vehicle_ids'].size()==0) /*ignore missing values*/return 0; String key = doc['vehicle_id']; if(state.vehicles.containsKey(key)) { state.vehicles[key]++; } else { state.vehicles[key] = 1; }",
        "combine_script": "double vehicles=0; for (v in state.vehicles.values) { if(v.value>10) { vehicles += v.value; } } return vehicles",
        "reduce_script" : "double vehicles=0; for (v in states) { vehicles += v; } return vehicles"
      }
    }

And are scripted_metrics more accurate than the cardinality aggregations?

polyfractal · April 15, 2020, 3:56pm

@karussell Sorry for the delay, meant to reply to this sooner!

So I haven't looked at the script too closely, but a concern with this kind of cardinality aggregation is memory. E.g. collecting the counts in a simple map will be 100% accurate, but also has a very high memory burden because each shard will have to maintain a map of terms and then serialize that map to the coordinator.

As a toy example, consider 20 shards each with 10m unique terms. If all those terms are unique across shards (which isn't unusual if run against something like an IP address, user ID, etc) that will generate 200m unique terms which the coordinator needs to merge. Ignoring runtime speed of merging, if each term is ~10b large, that's 2gb in aggregation responses that the coordinator needs to hold in memory while reducing.

If there are a couple of those requests running in parallel, it's very easy to get to a point that the node runs out of memory.

That's why the Elasticsearch cardinality aggregator uses a HyperLogLog sketch to approximate cardinality, rather than calculate the true cardinality. In exchange for 1-5% error (depending), you can estimate cardinality in a few hundred kilobytes.

So that's the disclaimer, and why one should be careful with scripted-metrics aggs in general. We do a lot to make sure aggs have efficient runtime costs in both time and space, but scripted-metric lets you do anything you want. And it's easy to accidentally write a foot-gun

karussell · April 15, 2020, 5:07pm

Thanks for your response and no worries

As a toy example, consider 20 shards each with 10m unique terms.

I know that I have much less than 100k terms in total, so memory shouldn't be an issue. And high precision count is important for this task. I was able to get rid of the NPE, when I skipped some "null" entries like so:

"combine_script": "if(state.vehicles == null || state.vehicles.values == null) return 0; ...

but the result is still wrong:

"aggregations":{"vehicle_count":{"value":0.0}}

How can I debug this and e.g. print some intermediate values?

system · May 13, 2020, 5:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
{"min_doc_count"} on Unique Count aggregation Kibana	2	3222	May 1, 2019
How to aggregate a unique count in Kibana if the count is under a threshold Kibana	4	4039	July 16, 2018
JSON Input for Unique Count Metrics Kibana	2	2883	December 9, 2019
How to filter on unique count aggreagtion Kibana	5	8324	February 26, 2019
Min docs count when using other aggregations Kibana	4	365	March 28, 2019

Display Minimum Count of Unique Count Field

Related topics