Accuracy of date histogram sub-aggregation doc count under terms aggregation

myronmarston · December 5, 2023, 10:26pm

Hello,

I am working on query that combines a terms aggregation with a date histogram sub-aggregation. I would like to get the doc count of each sub-aggregation bucket, determine if it is accurate, and, if it is not accurate, what the upper bound on the error is.

The terms aggregation documentation mentions an accuracy issue to be aware of:

Even with a larger shard_size value, doc_count values for a terms aggregation may be approximate. As a result, any sub-aggregations on the terms aggregation may also be approximate.

The doc_count_error_upper_bound field returned by a terms aggregation gives me the information I'm looking for. However, it is not returned on a date histogram aggregation (either as a stand-alone aggregation, or when used as a sub-aggregation).

For example, given this query:

GET /widgets_*/_search
{
  "aggs": {
    "term_groupings": {
      "terms": {
        "field": "options.color",
        "size": 5,
        "show_term_doc_count_error": true
      },
      "aggs": {
        "date_grouping": {
          "date_histogram": {
            "calendar_interval": "month",
            "min_doc_count": 1,
            "field": "created_at",
            "format": "strict_date_time"
          }
        }
      }
    }
  },
  "size": 0,
  "track_total_hits": false,
  "_source": false
}

...I get a response like this:

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "term_groupings": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "RED",
          "doc_count": 39,
          "doc_count_error_upper_bound": 0,
          "date_grouping": {
            "buckets": [
              {
                "key_as_string": "2023-10-01T00:00:00.000Z",
                "key": 1696118400000,
                "doc_count": 6
              },
              {
                "key_as_string": "2023-11-01T00:00:00.000Z",
                "key": 1698796800000,
                "doc_count": 33
              }
            ]
          }
        },
        {
          "key": "BLUE",
          "doc_count": 36,
          "doc_count_error_upper_bound": 0,
          "date_grouping": {
            "buckets": [
              {
                "key_as_string": "2023-10-01T00:00:00.000Z",
                "key": 1696118400000,
                "doc_count": 8
              },
              {
                "key_as_string": "2023-11-01T00:00:00.000Z",
                "key": 1698796800000,
                "doc_count": 28
              }
            ]
          }
        },
        {
          "key": "GREEN",
          "doc_count": 35,
          "doc_count_error_upper_bound": 0,
          "date_grouping": {
            "buckets": [
              {
                "key_as_string": "2023-10-01T00:00:00.000Z",
                "key": 1696118400000,
                "doc_count": 7
              },
              {
                "key_as_string": "2023-11-01T00:00:00.000Z",
                "key": 1698796800000,
                "doc_count": 28
              }
            ]
          }
        }
      ]
    }
  }
}

I get doc_count_error_upper_bound on the term grouping buckets, but not on the date
histogram buckets. I understand that a date histogram aggregation does not suffer from
the same accuracy issue that a terms aggregation does. However, as a sub-aggregation
of a terms aggregation, could it suffer from that issue? (After all, the term aggregation
documentation mentions that sub-aggregations can be affected). If so, how do I determine
whether the date histogram bucket doc counts are accurate, and what the error upper bound
might be?

Thanks!

system · January 2, 2024, 10:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem cardinality and date_histogram Elasticsearch	3	730	July 5, 2017
Does the term aggregation also return approximate value when doing sum/avg aggregation? Elasticsearch	4	430	March 21, 2019
Is it possible to do a date histogram aggregation and get percent totals? Elasticsearch	2	717	November 12, 2018
Aggregation date_histogram and terms does not respect min_doc_count Elasticsearch	3	767	November 18, 2019
Unable to get min_doc_count bucket for histogram aggs Elasticsearch	3	1469	June 28, 2018

Accuracy of date histogram sub-aggregation doc count under terms aggregation

Related topics