Display Minimum Count of Unique Count Field

I've calculated unique counts for a dataset but I'd like to only display when those counts are above a certain number. On regular fields I would use {"min_doc_count": 13} under the Json input but on unique counts I get an error. Is there a way to limit the number that is returned?

Also if you know how to include fields with missing values into the unique count, I'd be appreciative.

1 Like

Looking at the cardinality aggregation documentation, it looks like min_doc_count is not a valid parameter.

Would you mind creating an enhancement request at https://github.com/elastic/elasticsearch/issues/new to add min_doc_count support to cardinality aggregation

Interesting that I'm looking for the same feature at the same time :slight_smile: .
Have created: https://github.com/elastic/elasticsearch/issues/54649

How could I implement this with a scripted_metric? Currently I'm getting a null pointer exception:

    "vehicle_count": {
      "scripted_metric": {
        "init_script":    "state.vehicles = [:]",
        "map_script":     "if(doc['vrp.vehicle_ids'].size()==0) /*ignore missing values*/return 0; String key = doc['vehicle_id']; if(state.vehicles.containsKey(key)) { state.vehicles[key]++; } else { state.vehicles[key] = 1; }",
        "combine_script": "double vehicles=0; for (v in state.vehicles.values) { if(v.value>10) { vehicles += v.value; } } return vehicles",
        "reduce_script" : "double vehicles=0; for (v in states) { vehicles += v; } return vehicles"
      }
    }

And are scripted_metrics more accurate than the cardinality aggregations?

@karussell Sorry for the delay, meant to reply to this sooner!

So I haven't looked at the script too closely, but a concern with this kind of cardinality aggregation is memory. E.g. collecting the counts in a simple map will be 100% accurate, but also has a very high memory burden because each shard will have to maintain a map of terms and then serialize that map to the coordinator.

As a toy example, consider 20 shards each with 10m unique terms. If all those terms are unique across shards (which isn't unusual if run against something like an IP address, user ID, etc) that will generate 200m unique terms which the coordinator needs to merge. Ignoring runtime speed of merging, if each term is ~10b large, that's 2gb in aggregation responses that the coordinator needs to hold in memory while reducing.

If there are a couple of those requests running in parallel, it's very easy to get to a point that the node runs out of memory.

That's why the Elasticsearch cardinality aggregator uses a HyperLogLog sketch to approximate cardinality, rather than calculate the true cardinality. In exchange for 1-5% error (depending), you can estimate cardinality in a few hundred kilobytes.

So that's the disclaimer, and why one should be careful with scripted-metrics aggs in general. We do a lot to make sure aggs have efficient runtime costs in both time and space, but scripted-metric lets you do anything you want. And it's easy to accidentally write a foot-gun :slight_smile:

1 Like

Thanks for your response and no worries :slight_smile:

As a toy example, consider 20 shards each with 10m unique terms.

I know that I have much less than 100k terms in total, so memory shouldn't be an issue. And high precision count is important for this task. I was able to get rid of the NPE, when I skipped some "null" entries like so:

"combine_script": "if(state.vehicles == null || state.vehicles.values == null) return 0; ...

but the result is still wrong:

"aggregations":{"vehicle_count":{"value":0.0}}

How can I debug this and e.g. print some intermediate values?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.