Query to get exact cardinality counts

I'm trying to get an exact count of documents that meet a specific criteria. Aggs queries are somewhat helpful, but I need to be able to return results as well so hoping someone knows a good way to approach this problem. Here are what some of my documents look like:

{
  "doc_id": 1,
  "doc_type": "foo"
}

{
  "doc_id": 1,
  "doc_type": "foo"
}

{
  "doc_id": 2,
  "doc_type": "foo"
}

{
  "doc_id": 2,
  "doc_type": "bar"
}

The criteria I'm searching for is documents that have the same doc_id but more than one unique value for doc_type. In the above example doc_id = 1 would be fine and not picked up by my query, but doc_id = 2 is bad and I need to capture both the doc_id and that it's 1 instance of a result meeting my criteria. Does anyone know a good method to generate this information quickly? Currently I've got some python code that generates a list of every doc_id and then searches on them all individually and gets the unique values... but that's not very quick and I have millions of documents. Is there a better way to go about this? I know a cardinality query would work but my understanding is the counts aren't exact and as this spans over multiple shards I'm not sure I can count on those results. Hoping there is a more efficient way than I'm currently approaching the problem to solve this.

Hi!

In case, you absolutely need 100% accuracy for the count, you might not be able to use cardinality (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html), but my experience shows that it's really close to the value I'm looking for. Of course it depends on shard and document number. In case you only need to know if there are more than one "types", I'd say (though it's a guess) that you will get that info, just not the correct count. Maybe precision_threshold can help you out.
Otherwise, you can try simply getting the doc_type count by using a terms aggregation and check if there are more than one buckets.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.