Accuracy of date histogram sub-aggregation doc count under terms aggregation

Hello,

I am working on query that combines a terms aggregation with a date histogram sub-aggregation. I would like to get the doc count of each sub-aggregation bucket, determine if it is accurate, and, if it is not accurate, what the upper bound on the error is.

The terms aggregation documentation mentions an accuracy issue to be aware of:

Even with a larger shard_size value, doc_count values for a terms aggregation may be approximate. As a result, any sub-aggregations on the terms aggregation may also be approximate.

The doc_count_error_upper_bound field returned by a terms aggregation gives me the information I'm looking for. However, it is not returned on a date histogram aggregation (either as a stand-alone aggregation, or when used as a sub-aggregation).

For example, given this query:

GET /widgets_*/_search
{
  "aggs": {
    "term_groupings": {
      "terms": {
        "field": "options.color",
        "size": 5,
        "show_term_doc_count_error": true
      },
      "aggs": {
        "date_grouping": {
          "date_histogram": {
            "calendar_interval": "month",
            "min_doc_count": 1,
            "field": "created_at",
            "format": "strict_date_time"
          }
        }
      }
    }
  },
  "size": 0,
  "track_total_hits": false,
  "_source": false
}

...I get a response like this:

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "term_groupings": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "RED",
          "doc_count": 39,
          "doc_count_error_upper_bound": 0,
          "date_grouping": {
            "buckets": [
              {
                "key_as_string": "2023-10-01T00:00:00.000Z",
                "key": 1696118400000,
                "doc_count": 6
              },
              {
                "key_as_string": "2023-11-01T00:00:00.000Z",
                "key": 1698796800000,
                "doc_count": 33
              }
            ]
          }
        },
        {
          "key": "BLUE",
          "doc_count": 36,
          "doc_count_error_upper_bound": 0,
          "date_grouping": {
            "buckets": [
              {
                "key_as_string": "2023-10-01T00:00:00.000Z",
                "key": 1696118400000,
                "doc_count": 8
              },
              {
                "key_as_string": "2023-11-01T00:00:00.000Z",
                "key": 1698796800000,
                "doc_count": 28
              }
            ]
          }
        },
        {
          "key": "GREEN",
          "doc_count": 35,
          "doc_count_error_upper_bound": 0,
          "date_grouping": {
            "buckets": [
              {
                "key_as_string": "2023-10-01T00:00:00.000Z",
                "key": 1696118400000,
                "doc_count": 7
              },
              {
                "key_as_string": "2023-11-01T00:00:00.000Z",
                "key": 1698796800000,
                "doc_count": 28
              }
            ]
          }
        }
      ]
    }
  }
}

I get doc_count_error_upper_bound on the term grouping buckets, but not on the date
histogram buckets. I understand that a date histogram aggregation does not suffer from
the same accuracy issue that a terms aggregation does. However, as a sub-aggregation
of a terms aggregation, could it suffer from that issue? (After all, the term aggregation
documentation mentions that sub-aggregations can be affected). If so, how do I determine
whether the date histogram bucket doc counts are accurate, and what the error upper bound
might be?

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.