Inconsistency count observed for term aggregation operation

Describe the issue: Inconsistency count observed for term aggregation operation
We are applying aggregation on keyword type field (non-nested)

our mapping

{

    "mappings": {

      "dynamic": "strict",

      "properties": {

        "os": {

          "properties": {

            "name": {

              "type": "text",

              "fields": {

                "raw": {

                  "type": "keyword"

                }

              },

              "copy_to": [

                "fullText"

              ]

            }

          }

        }

      }

    }

  }

our query

{

    "size": 0,

    "aggregations": {

        "os.name.raw-TERMS": {

            "terms": {

                "field": "os.name.raw",

                "size": 10,

                //  "shard_size" -> maybe 15 to 20  

                "min_doc_count": 1,

                "shard_min_doc_count": 0,

                "show_term_doc_count_error": false,

                "order": [

                    {

                        "_count": "desc"

                    },

                    {

                        "_key": "asc"

                    }

                ]

            }

        }

    }

}

response

{

    "took": 5,

    "timed_out": false,

    "_shards": {

        "total": 15,

        "successful": 15,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "os.name.raw-TERMS": {

            "doc_count_error_upper_bound": 72,

            "sum_other_doc_count": 2631,

            "buckets": [

                {

                    "key": "Unidentified",

                    "doc_count": 25349

                },

                ... 6 document present

                {

                    "key": "Microsoft Windows",

                    "doc_count": 74

                },

                ... 2 document present

            ]

        }

    }

}

as you can see in aggregation bucket
“Microsoft Windows” doc_count is 74
if we increase shard size >= 1000, we get accurate result
because doc_count_error_upper_bound is 0

i checked the internal working from documentation that shard_size is number of buckets created per shards before merging into coordinate node to provide result.
and doc_count_error_upper_bound :0 means results are now accurate.

ISSUE!
here we have a issue.

i checked the uniqueness count with cardinality aggregation for os.name.raw it is 478, so anything above this in shard_size will make doc_count_error_upper_bound as 0.
but it is not limited for a single index but for multiple index we need TOP N result, we cant check its uniqueness for all before querying it.
If we put a higher cap on shard_size, it is still possible that uniqueness may increase in the future and cross that value too. So whatever value we put, there is no guarantee that it will always produce accurate results.

We are sure that data uniqueness will continue to increase in the index over time.
This means we will have to keep adjusting shard_size repeatedly to maintain accuracy.
We want to avoid manually adjusting this value or using very high values because it causes heavy computation and performance overhead.

NEED
I want to know if any other aggregation or way to get top N document document with accurate count.
we want to sort document in desc order and keeping keys in ascending.
In this order we need result
{
“_count”: “desc”
},
{
“_key”: “asc”
}

#Already check composite agg, sort by _count is not supported

Please help me how to resolve this issue or if there is a recommended approach for this scenario.

Thanks!

Which version of Elasticsearch are you using?

What you correctly are describing is a tradeoff when trying to achieve accurate aggregations across large datasets in a distributed environment. There is no magic aggregation or solution that gives you what you need as all keys need to be passed back to the coordinating node in order for the aggregation to be accurate. If you want accuracy you do need to increase the number of buckets, which adds overhead, latency and as far as I recall also is capped at some value.

The only alternative approach I can think of might be to create a transform feeding a separate entity-centric index keeping track of the count. This might work by term and possibly also time period and will be faster and more efficient. It will avoid not counting terms that are rare in specific shards, but will likely be somewhat inaccurate anyway as it only runs periodically and may miss some of the latest data.

1 Like

we are using 7.x version but i think this is case with all version of Elasticsearch
ref → Terms aggregation | Reference
It is mention in documentation.

I asked about the version used as I was thinking about transforms.

For a more detailed discussion I would recommend Mark Harwoods response in this thread.

I believe you will need to make some kind of tradeoff, and transforms gives you one more option.

If you upgrade to a recent version of Elasticsearch you can use ES|QL which I believe is accurate by default.

1 Like

Could you perhaps show how this is done in ES|QL and describe how it behind the scenes gets around the limitation that applies to the aggregation when aggregating a high-cardinality field across a large set of shards? Does this mean that it is slower and more computationally expensive than the example aggregation?

I am not an expert in ES|QL but I will try my best. The provided query will translate into something like:

POST /_query?format=txt
{
  "query": "FROM <indices> | STATS total_count = count() BY os.name.raw | SORT total_count DESC, os.name.raw ASC| LIMIT 10"
}

In query DSL, we collect accurate results by shard but we only send the top n (being n the shard size) elements to the coordinator node for reduction. In ES|QL we are sending all the results to the coordinator, therefore it is always accurate.

The reason we can do that in ES|QL is because we are using a more efficient and dense representation of the aggregation results so we can handle many more results and it is memory accounted all the way so if there is too much data we should fail gracefully.

1 Like

Thanks @Ignacio_Vera , that is very useful information and good to have documented somewhere as it is a reasonably common question.