Hello,
I am working on query that combines a terms aggregation with a date histogram sub-aggregation. I would like to get the doc count of each sub-aggregation bucket, determine if it is accurate, and, if it is not accurate, what the upper bound on the error is.
The terms aggregation documentation mentions an accuracy issue to be aware of:
Even with a larger
shard_size
value,doc_count
values for aterms
aggregation may be approximate. As a result, any sub-aggregations on theterms
aggregation may also be approximate.
The doc_count_error_upper_bound
field returned by a terms aggregation gives me the information I'm looking for. However, it is not returned on a date histogram aggregation (either as a stand-alone aggregation, or when used as a sub-aggregation).
For example, given this query:
GET /widgets_*/_search
{
"aggs": {
"term_groupings": {
"terms": {
"field": "options.color",
"size": 5,
"show_term_doc_count_error": true
},
"aggs": {
"date_grouping": {
"date_histogram": {
"calendar_interval": "month",
"min_doc_count": 1,
"field": "created_at",
"format": "strict_date_time"
}
}
}
}
},
"size": 0,
"track_total_hits": false,
"_source": false
}
...I get a response like this:
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"max_score": null,
"hits": []
},
"aggregations": {
"term_groupings": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "RED",
"doc_count": 39,
"doc_count_error_upper_bound": 0,
"date_grouping": {
"buckets": [
{
"key_as_string": "2023-10-01T00:00:00.000Z",
"key": 1696118400000,
"doc_count": 6
},
{
"key_as_string": "2023-11-01T00:00:00.000Z",
"key": 1698796800000,
"doc_count": 33
}
]
}
},
{
"key": "BLUE",
"doc_count": 36,
"doc_count_error_upper_bound": 0,
"date_grouping": {
"buckets": [
{
"key_as_string": "2023-10-01T00:00:00.000Z",
"key": 1696118400000,
"doc_count": 8
},
{
"key_as_string": "2023-11-01T00:00:00.000Z",
"key": 1698796800000,
"doc_count": 28
}
]
}
},
{
"key": "GREEN",
"doc_count": 35,
"doc_count_error_upper_bound": 0,
"date_grouping": {
"buckets": [
{
"key_as_string": "2023-10-01T00:00:00.000Z",
"key": 1696118400000,
"doc_count": 7
},
{
"key_as_string": "2023-11-01T00:00:00.000Z",
"key": 1698796800000,
"doc_count": 28
}
]
}
}
]
}
}
}
I get doc_count_error_upper_bound
on the term grouping buckets, but not on the date
histogram buckets. I understand that a date histogram aggregation does not suffer from
the same accuracy issue that a terms aggregation does. However, as a sub-aggregation
of a terms aggregation, could it suffer from that issue? (After all, the term aggregation
documentation mentions that sub-aggregations can be affected). If so, how do I determine
whether the date histogram bucket doc counts are accurate, and what the error upper bound
might be?
Thanks!