Error in per bucket doc_count_error_upper_bound for Term Aggregation?

Nishikant_Tayade · January 13, 2022, 9:15am

Below configuration for Elasticsearch:

1 Cluster
1 Node
1 Index
3 Shards (1 Replica shard for each primary, but in UNASSIGNED state as there is only 1 node).

I have indexed document and those are spread across 3 Shards(Shard-0, Shard-1,Shard-2).

Term Aggregation I am trying:

POST myIndex/_search
{
  "query": {"match_all": {}}, 
  "size":0,
  "aggs": {
    "products": {
      "terms": {
        "field": "BillToID",
        "size": 10,
        "shard_size": 11,
        "show_term_doc_count_error": true
      }
    }
  }
}

Response :-

"aggregations" : {
    "products" : {
      "doc_count_error_upper_bound" : 7,
      "sum_other_doc_count" : 12,
      "buckets" : [
        {
          "key" : "ProductA",
          "doc_count" : 100,
          "doc_count_error_upper_bound" : 6
        },
        {
          "key" : "ProductC",
          "doc_count" : 54,
          "doc_count_error_upper_bound" : 6
        },
        {
          "key" : "ProductZ",
          "doc_count" : 52,
          "doc_count_error_upper_bound" : 6
        },
        {
          "key" : "ProductG",
          "doc_count" : 47,
          "doc_count_error_upper_bound" : 6
        },
        {
          "key" : "ProductH",
          "doc_count" : 44,
          "doc_count_error_upper_bound" : 6
        },
        {
          "key" : "ProductB",
          "doc_count" : 43,
          "doc_count_error_upper_bound" : 6
        },
        {
          "key" : "ProductE",
          "doc_count" : 31,
          "doc_count_error_upper_bound" : 6
        },
        {
          "key" : "ProductF",
          "doc_count" : 19,
          "doc_count_error_upper_bound" : 6
        },
        {
          "key" : "ProductI",
          "doc_count" : 11,
          "doc_count_error_upper_bound" : 6
        },
        {
          "key" : "ProductJ",
          "doc_count" : 9,
          "doc_count_error_upper_bound" : 6
        }
      ]
    }
  }

From Defination in Docs Of Per Bucket doc_count_error_upper_bound =

This is calculated by summing the document counts for the last term returned by all shards which did not return the term.

Problem : But When I checked I can see ProductA has been returned by each shard, so why does it shows "doc_count_error_upper_bound" : 6 for ProductA?

Any help is much appreciated:)

Mark_Harwood · January 13, 2022, 9:40am

Hi Nishikant,

Can you describe how you verified what was returned from each shard?
Just want to check your debugging approach is valid.

"shard_size": 11,

I'm guessing this was set to this value just for this debugging exercise? Ordinarily the default would be higher.

Nishikant_Tayade · January 13, 2022, 10:03am

Hi @Mark_Harwood ,
Thanks for replying
Sure!

Can you describe how you verified what was returned from each shard?

In total I indexed 422 documents into my Index.
then to check how much document of a particular unique term(ex:ProductA) does a shard hold, I used below query :--

POST /myIndex/_search?preference=_shards:0
{
  "query": {
    "match": {
      "BillToID": "ProductA"
    }
  }
}

If the values of totalHits > 0 , means shard-0 does hold document for ProductA, and how many documets it holds can be identified by totalHits.
I did this for every unique terms(like ProductA, ProductB as I know all of them beforehand) and on every shard

i.e ProductA count on -> Shard-0(35), Shard-1(33), Shard-2(32)

Now, As per doc,

Each shard prepares a Priority Queue of Shard_Size with documents in descending order count.

As ProductA documets have highest documets count in each shard, so it must have been included in Priority Queue of every shard, when co-ordinator node finally prepared the global sorted list, ProductA will be on top of list as it has highest count.

Now for Per Bucket doc_count_error_upper_bound =

This is calculated by summing the document counts for the last term returned by all shards which did not return the term.

But by above checks, it has been returned from all shards, then how come
doc_count_error_upper_bound = 6 is showing for ProductA bucket?

I'm guessing this was set to this value just for this debugging exercise? Ordinarily the default would be higher.

Yes

Mark_Harwood · January 13, 2022, 10:45am

Thanks for the thorough response. That does look like a valid way of checking the underlying stats and the results don't seem to tally with the description:

This is calculated by summing the document counts for the last term returned by all shards which did not return the term.

I notice this description is from the 6.8 docs but has changed in the 7.x docs. Let me do some digging. What version are you running?

Nishikant_Tayade · January 13, 2022, 10:49am

@Mark_Harwood

I notice this description is from the 6.8 docs but has changed in the 7.x docs. Let me do some digging. What version are you running?

I am running 7.10.2, I did check the description on the same version, it also states the same.

This is calculated by summing the document counts for the last term returned by all shards which did not return the term.

Let me do some digging

Thanks!! Please let me know, If you need any additional data from me.

Mark_Harwood · January 13, 2022, 10:56am

If we get into needing to reproduce we could do with a minimal script required for reproduction which includes data. I don't think we're there yet so I'll wait for a response from the aggs team first.

Nishikant_Tayade · January 17, 2022, 4:53am

Hi @Mark_Harwood ,

Can you please provide any updates on the above issue?

Mark_Harwood · January 17, 2022, 7:07pm

I’ve not heard anything back from the aggs team so I recommend opening a GitHub bug issue on the Elasticsearch repo with a reproducible example.

system · February 14, 2022, 7:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.