Is the precision of cardinality aggregation decided by total unique value count or filtered unique value count?

Hi,

We have an index with a field containing ~30,000 unique values.
When doing a filtered cardinality aggregation on this field, which should return ~650 unique values, we experience non-deterministic results (±25).

We are using a precision threshold of 10,000. With this configuration counts are expected to be close to accurate under 10,000 unique values, or does this limit apply to the total value count?

Thank you in advance!
Best,
Henrik

1 Like

Hi @henrhoi,

Two questions;

  1. Could you provide which version of Elasticsearch you're using?
  2. Could you provide an example query which you are running?

As noted here there is no "guarantee" of accuracy, and in the bullet point under precision control it subtly mentions:

The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate

Though seeing your query might provide more insight into potential optimizations. You can also look at, Accurate Distinct Count and Values from Elasticsearch. | by Pratik Patil | Medium, which provides some good examples of getting "accurate" counts at the cost of speed.

Thank you for the quick response. See my answers below.

  1. We are using version 7.16.2.
  2. Providing an example query:

For context, the split below is used in a pivot table, without any splits, where the Total-script is used to get this value under a "Total" bucket.

{
    "aggs": {
        "column_1": {
            "aggs": {
                "value_524933379": {
                    "cardinality": {
                        "field": "some_field",
                        "precision_threshold": 10000
                    }
                }
            },
            "terms": {
                "missing": -1,
                "order": {
                    "_key": "asc"
                },
                "script": {
                    "lang": "painless",
                    "source": "(('Total').toString())"
                },
                "shard_size": 200,
                "size": 10
            }
        }
    },
    "query": {
        "filter": [],
        "must": [
            {
                "terms": {
                    "another_field": [
                        "SOME VALUE"
                    ]
                }
            }
        ]
    },
    "size": 0,
    "track_total_hits": false
}

Thank you.

Overall, the query looks relatively good, two possible suggestions (I'm not 100% confident they'll improve results though).

  1. For your query, try using filter instead of must if you don't need the scoring that must offers.
  2. Try using the execution_hint direct (or play around with the other possible options as well) to see if you get better accuracy.

Thanks, Ben!

I'll try those suggestions.

In general, what behaviour or accuracy should we expect when aggregating on full vs filtered data? Should we expect accurate results when the data is filtered to <1000 unique values?

Best,
Henrik

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.