Query DSL count distinct

Hi,

I am looking for precise count results for index with over 70 millions documents.
Equal to this: SQL: select count(distinct column) from table;

I already tried Terms-aggregation, composite-aggregation, cardinality-aggregation. But didn't get exact count.

What is the best approach for it?

If you want to count the number of documents, try:

GET /your-index/_search?track_total_hits=true&size=0

I want to count distinct value of field.
something like: select count(distinct name) from table;

So terms aggregation is the way to go.

The index with over 70 millions documents. I already tried Terms-aggregation, composite-aggregation, cardinality-aggregation. But didn't get exact count.

Could you share the request and the response please? With a terms agg.

GET abc/_search
{
"aggs":{
        "unique_fields": {
            "terms": {
                "field": "source",
                "size": 10000,
                "order": {
                  "_key": "asc"
                }
            }
        }
    },
    "size": 0
}

In order to compare I ran count on each "source" field and saw that the values are not equal.

Could you share the output?
I'd like to see the first part (with some first terms)

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "unique_fields" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "FIFO",
          "doc_count" : 13
        },
        {
          "key" : "FW",
          "doc_count" : 77757759
        },
        {
          "key" : "HW",
          "doc_count" : 12394396
        },
        {
          "key" : "Info",
          "doc_count" : 6705
        }
      ]
    }
  }
}

The numbers you are seeing are the exact numbers. What makes you think that they are not?

terms agg, you can use shard_size

cardinality agg, you can use precision_threshold
image

When running count on specific field I get different number.

For Example:
When running the following I get count 6561, while with terms aggs I got 6705.

GET baseelements_rwr/_count
{
  "query": {
    "bool": {
        "must": [
            {"match": {"source": "Info"}}
        ]
    }
  }
}

Cardinality aggregation, like some other aggregation types, are by design approximations. This is generally highligted in the documentation. I recommend reading the explanation and discussion provided in this thread as well as this one.

Could you try with:

GET baseelements_rwr/_count
{
  "query": {
    "bool": {
        "must": [
            {"term": {"source": "Info"}}
        ]
    }
  }
}

What is the mapping for this source field?
What does a typical document look like?

source field is keyword.

Same results when switching to term instead match.

Could you share a typical document please?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.