Questions about aggregation min_doc_count = 0

John_Stanford · March 18, 2014, 7:39pm

Hi,

I'm trying to get a better understanding of aggregations, so here are a
couple of questions that came up recently.

Question 1:

I have some time based data that I am using aggregations to chart. The
data may be sparsely populated, so I've been setting min_doc_count to 0 so
I get empty buckets back anyway. I've noticed that it will fill in empty
buckets unless they are before or after the first record of the range.

For example, if I use a query similar to the one below, and there are no
records after 3/15/14T16:15, the last aggregation record will be for
3/15/14T16:15. On the other hand, if there is a gap in between the start
time and 3/15/14T16:15, I will get a bucket with a 0 doc count (as
expected).

POST _all/summary_phys/_search

{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw"
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

Not getting the 0 doc count buckets back at the front and back of the range
seems contrary to the documented purpose of min_doc_count. Am I doing
something wrong?

Question 2:

If I add a min_doc_count = 0 to the inner aggregation, but limit the search
to a specific doc type like:

                  doc type
                       v

POST _all/summary_phys/_search
{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

I get buckets with entries matching hosts that do not show up in this doc
type. For example, I have only 3 values for host in this doc type
[compute-4, compute-2, compute-3], but I will get buckets back with hosts
from other doc types like:

"events_by_host": {
"buckets": [
{
"key": "compute-4",
"doc_count": 11,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 3677.090909090909
}
},
{
"key": "compute-2",
"doc_count": 8,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 2304
}
},
{
"key": "compute-3",
"doc_count": 2,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 4608
}
},
{
"key": "10.10.11.22:49509",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "controller",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "object-1",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
}
]
}

Is there a way to ensure that the inner aggregation also only buckets
things matching the search doc type?

Thanks in advance...

John

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/856133dc-c4ae-4cfc-adab-39453671d76d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mattweber · March 18, 2014, 8:17pm

The histogram aggregation (and facet) work on indexed values not based
on the current time or "now". So, if the last indexed document timestamp
is 3/15/14T16:15 you will not get empty buckets between 3/15/14T16:15 and the
current time. It would be interesting to be able to set the "to" and
"from" on histogram based aggregations to allow for generating buckets on
intervals between the defined range.
I believe this is the way the keys are pulled from the fielddata which
is index level data. So if you are using the "all" index you are going to
get data from all indices. Not sure if this is a bug or not. You can try
applying a filter aggregation:

POST _all/summary_phys/_search
{
"aggs": {
"summary_phys_events": {
"filter": {
"type": {"value": "summary_phys_events"}
},
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}
}
}

On Tue, Mar 18, 2014 at 12:39 PM, John Stanford jxstanford@gmail.comwrote:

Hi,

I'm trying to get a better understanding of aggregations, so here are a
couple of questions that came up recently.

Question 1:

I have some time based data that I am using aggregations to chart. The
data may be sparsely populated, so I've been setting min_doc_count to 0 so
I get empty buckets back anyway. I've noticed that it will fill in empty
buckets unless they are before or after the first record of the range.

For example, if I use a query similar to the one below, and there are no
records after 3/15/14T16:15, the last aggregation record will be for
3/15/14T16:15. On the other hand, if there is a gap in between the start
time and 3/15/14T16:15, I will get a bucket with a 0 doc count (as
expected).

POST _all/summary_phys/_search

{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw"
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

Not getting the 0 doc count buckets back at the front and back of the
range seems contrary to the documented purpose of min_doc_count. Am I
doing something wrong?

Question 2:

If I add a min_doc_count = 0 to the inner aggregation, but limit the
search to a specific doc type like:
                  doc type
                       v
POST _all/summary_phys/_search
{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

I get buckets with entries matching hosts that do not show up in this doc
type. For example, I have only 3 values for host in this doc type
[compute-4, compute-2, compute-3], but I will get buckets back with hosts
from other doc types like:

"events_by_host": {
"buckets": [
{
"key": "compute-4",
"doc_count": 11,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 3677.090909090909
}
},
{
"key": "compute-2",
"doc_count": 8,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 2304
}
},
{
"key": "compute-3",
"doc_count": 2,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 4608
}
},
{
"key": "10.10.11.22:49509",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "controller",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "object-1",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
}
]
}

Is there a way to ensure that the inner aggregation also only buckets
things matching the search doc type?

Thanks in advance...

John

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/856133dc-c4ae-4cfc-adab-39453671d76d%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/856133dc-c4ae-4cfc-adab-39453671d76d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoD1S47%2Bdu4hU8wAugzJW4LnWgP4A2XhjARLBnP2hvStJA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

John_Stanford · March 19, 2014, 3:40am

Thanks Matt, I suspected as much on #1. I think it might save a little post-processing if it provided buckets for the specified range. The issue appears to be logged as Create empty buckets in date_/histogram aggregation at the edges, beyond the value space of the data · Issue #5224 · elastic/elasticsearch · GitHub and a pull request has been made. I tried the filter on #2, and it still picked up hosts that weren’t in that doc type, so I filed filtering aggregations · Issue #5458 · elastic/elasticsearch · GitHub.

Cheers,

John

jxstanford@gmail.com
@jxstanford

On Mar 18, 2014, at 13:17:10, Matt Weber matt.weber@gmail.com wrote:

The histogram aggregation (and facet) work on indexed values not based on the current time or "now". So, if the last indexed document timestamp is 3/15/14T16:15 you will not get empty buckets between 3/15/14T16:15 and the current time. It would be interesting to be able to set the "to" and "from" on histogram based aggregations to allow for generating buckets on intervals between the defined range.

I believe this is the way the keys are pulled from the fielddata which is index level data. So if you are using the "all" index you are going to get data from all indices. Not sure if this is a bug or not. You can try applying a filter aggregation:

POST _all/summary_phys/_search
{
"aggs": {
"summary_phys_events": {
"filter": {
"type": {"value": "summary_phys_events"}
},
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}
}
}

On Tue, Mar 18, 2014 at 12:39 PM, John Stanford jxstanford@gmail.com wrote:
Hi,

I'm trying to get a better understanding of aggregations, so here are a couple of questions that came up recently.

Question 1:

I have some time based data that I am using aggregations to chart. The data may be sparsely populated, so I've been setting min_doc_count to 0 so I get empty buckets back anyway. I've noticed that it will fill in empty buckets unless they are before or after the first record of the range.

For example, if I use a query similar to the one below, and there are no records after 3/15/14T16:15, the last aggregation record will be for 3/15/14T16:15. On the other hand, if there is a gap in between the start time and 3/15/14T16:15, I will get a bucket with a 0 doc count (as expected).

POST _all/summary_phys/_search

{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw"
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

Not getting the 0 doc count buckets back at the front and back of the range seems contrary to the documented purpose of min_doc_count. Am I doing something wrong?

Question 2:

If I add a min_doc_count = 0 to the inner aggregation, but limit the search to a specific doc type like:
                  doc type
                       v
POST _all/summary_phys/_search
{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

I get buckets with entries matching hosts that do not show up in this doc type. For example, I have only 3 values for host in this doc type [compute-4, compute-2, compute-3], but I will get buckets back with hosts from other doc types like:

"events_by_host": {
"buckets": [
{
"key": "compute-4",
"doc_count": 11,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 3677.090909090909
}
},
{
"key": "compute-2",
"doc_count": 8,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 2304
}
},
{
"key": "compute-3",
"doc_count": 2,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 4608
}
},
{
"key": "10.10.11.22:49509",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "controller",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "object-1",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
}
]
}

Is there a way to ensure that the inner aggregation also only buckets things matching the search doc type?

Thanks in advance...

John

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/856133dc-c4ae-4cfc-adab-39453671d76d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/kz0eFP7nZMU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoD1S47%2Bdu4hU8wAugzJW4LnWgP4A2XhjARLBnP2hvStJA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Unable to get min_doc_count bucket for histogram aggs Elasticsearch	3	1478	June 28, 2018
Aggregation date_histogram and terms does not respect min_doc_count Elasticsearch	3	770	November 18, 2019
ES Aggregation (Bug?) - No buckets results at high "min_doc_count" and low "size" Elasticsearch	2	595	September 19, 2017
Where to add min_doc-count Elasticsearch	2	638	October 15, 2020
Min_doc_count on lower/lowest level nested aggregation Elasticsearch	1	536	July 6, 2017

Questions about aggregation min_doc_count = 0

Related topics