Questions about aggregation min_doc_count = 0


(John Stanford) #1

Hi,

I'm trying to get a better understanding of aggregations, so here are a
couple of questions that came up recently.

Question 1:

I have some time based data that I am using aggregations to chart. The
data may be sparsely populated, so I've been setting min_doc_count to 0 so
I get empty buckets back anyway. I've noticed that it will fill in empty
buckets unless they are before or after the first record of the range.

For example, if I use a query similar to the one below, and there are no
records after 3/15/14T16:15, the last aggregation record will be for
3/15/14T16:15. On the other hand, if there is a gap in between the start
time and 3/15/14T16:15, I will get a bucket with a 0 doc count (as
expected).

POST _all/summary_phys/_search

{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw"
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

Not getting the 0 doc count buckets back at the front and back of the range
seems contrary to the documented purpose of min_doc_count. Am I doing
something wrong?

Question 2:

If I add a min_doc_count = 0 to the inner aggregation, but limit the search
to a specific doc type like:

                  doc type
                       v

POST _all/summary_phys/_search
{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

I get buckets with entries matching hosts that do not show up in this doc
type. For example, I have only 3 values for host in this doc type
[compute-4, compute-2, compute-3], but I will get buckets back with hosts
from other doc types like:

"events_by_host": {
"buckets": [
{
"key": "compute-4",
"doc_count": 11,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 3677.090909090909
}
},
{
"key": "compute-2",
"doc_count": 8,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 2304
}
},
{
"key": "compute-3",
"doc_count": 2,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 4608
}
},
{
"key": "10.10.11.22:49509",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "controller",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "object-1",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
}
]
}

Is there a way to ensure that the inner aggregation also only buckets
things matching the search doc type?

Thanks in advance...

John

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/856133dc-c4ae-4cfc-adab-39453671d76d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Matt Weber) #2
  1. The histogram aggregation (and facet) work on indexed values not based
    on the current time or "now". So, if the last indexed document timestamp
    is 3/15/14T16:15 you will not get empty buckets between 3/15/14T16:15 and the
    current time. It would be interesting to be able to set the "to" and
    "from" on histogram based aggregations to allow for generating buckets on
    intervals between the defined range.

  2. I believe this is the way the keys are pulled from the fielddata which
    is index level data. So if you are using the "all" index you are going to
    get data from all indices. Not sure if this is a bug or not. You can try
    applying a filter aggregation:

POST _all/summary_phys/_search
{
"aggs": {
"summary_phys_events": {
"filter": {
"type": {"value": "summary_phys_events"}
},
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}
}
}

On Tue, Mar 18, 2014 at 12:39 PM, John Stanford jxstanford@gmail.comwrote:

Hi,

I'm trying to get a better understanding of aggregations, so here are a
couple of questions that came up recently.

Question 1:

I have some time based data that I am using aggregations to chart. The
data may be sparsely populated, so I've been setting min_doc_count to 0 so
I get empty buckets back anyway. I've noticed that it will fill in empty
buckets unless they are before or after the first record of the range.

For example, if I use a query similar to the one below, and there are no
records after 3/15/14T16:15, the last aggregation record will be for
3/15/14T16:15. On the other hand, if there is a gap in between the start
time and 3/15/14T16:15, I will get a bucket with a 0 doc count (as
expected).

POST _all/summary_phys/_search

{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw"
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

Not getting the 0 doc count buckets back at the front and back of the
range seems contrary to the documented purpose of min_doc_count. Am I
doing something wrong?

Question 2:

If I add a min_doc_count = 0 to the inner aggregation, but limit the
search to a specific doc type like:

                  doc type
                       v

POST _all/summary_phys/_search
{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

I get buckets with entries matching hosts that do not show up in this doc
type. For example, I have only 3 values for host in this doc type
[compute-4, compute-2, compute-3], but I will get buckets back with hosts
from other doc types like:

"events_by_host": {
"buckets": [
{
"key": "compute-4",
"doc_count": 11,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 3677.090909090909
}
},
{
"key": "compute-2",
"doc_count": 8,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 2304
}
},
{
"key": "compute-3",
"doc_count": 2,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 4608
}
},
{
"key": "10.10.11.22:49509",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "controller",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "object-1",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
}
]
}

Is there a way to ensure that the inner aggregation also only buckets
things matching the search doc type?

Thanks in advance...

John

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/856133dc-c4ae-4cfc-adab-39453671d76d%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/856133dc-c4ae-4cfc-adab-39453671d76d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoD1S47%2Bdu4hU8wAugzJW4LnWgP4A2XhjARLBnP2hvStJA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(John Stanford) #3

Thanks Matt, I suspected as much on #1. I think it might save a little post-processing if it provided buckets for the specified range. The issue appears to be logged as https://github.com/elasticsearch/elasticsearch/issues/5224 and a pull request has been made. I tried the filter on #2, and it still picked up hosts that weren’t in that doc type, so I filed https://github.com/elasticsearch/elasticsearch/issues/5458.

Cheers,

John

jxstanford@gmail.com
@jxstanford

On Mar 18, 2014, at 13:17:10, Matt Weber matt.weber@gmail.com wrote:

  1. The histogram aggregation (and facet) work on indexed values not based on the current time or "now". So, if the last indexed document timestamp is 3/15/14T16:15 you will not get empty buckets between 3/15/14T16:15 and the current time. It would be interesting to be able to set the "to" and "from" on histogram based aggregations to allow for generating buckets on intervals between the defined range.

  2. I believe this is the way the keys are pulled from the fielddata which is index level data. So if you are using the "all" index you are going to get data from all indices. Not sure if this is a bug or not. You can try applying a filter aggregation:

POST _all/summary_phys/_search
{
"aggs": {
"summary_phys_events": {
"filter": {
"type": {"value": "summary_phys_events"}
},
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}
}
}

On Tue, Mar 18, 2014 at 12:39 PM, John Stanford jxstanford@gmail.com wrote:
Hi,

I'm trying to get a better understanding of aggregations, so here are a couple of questions that came up recently.

Question 1:

I have some time based data that I am using aggregations to chart. The data may be sparsely populated, so I've been setting min_doc_count to 0 so I get empty buckets back anyway. I've noticed that it will fill in empty buckets unless they are before or after the first record of the range.

For example, if I use a query similar to the one below, and there are no records after 3/15/14T16:15, the last aggregation record will be for 3/15/14T16:15. On the other hand, if there is a gap in between the start time and 3/15/14T16:15, I will get a bucket with a 0 doc count (as expected).

POST _all/summary_phys/_search

{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw"
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

Not getting the 0 doc count buckets back at the front and back of the range seems contrary to the documented purpose of min_doc_count. Am I doing something wrong?

Question 2:

If I add a min_doc_count = 0 to the inner aggregation, but limit the search to a specific doc type like:

                  doc type
                       v

POST _all/summary_phys/_search
{
"aggs": {
"events_by_date": {
"date_histogram": {
"field": "@timestamp",
"interval": "300s",
"min_doc_count": 0
},
"aggs": {
"events_by_host": {
"terms": {
"field": "host.raw",
"min_doc_count": 0
},
"aggs": {
"avg_used": {
"avg": {
"field": "used"
}
},
"max_used": {
"max": {
"field": "used"
}
}
}
}
}
}
}
}

I get buckets with entries matching hosts that do not show up in this doc type. For example, I have only 3 values for host in this doc type [compute-4, compute-2, compute-3], but I will get buckets back with hosts from other doc types like:

"events_by_host": {
"buckets": [
{
"key": "compute-4",
"doc_count": 11,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 3677.090909090909
}
},
{
"key": "compute-2",
"doc_count": 8,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 2304
}
},
{
"key": "compute-3",
"doc_count": 2,
"max_used": {
"value": 4608
},
"avg_used": {
"value": 4608
}
},
{
"key": "10.10.11.22:49509",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "controller",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
},
{
"key": "object-1",
"doc_count": 0,
"max_used": {
"value": null
},
"avg_used": {
"value": null
}
}
]
}

Is there a way to ensure that the inner aggregation also only buckets things matching the search doc type?

Thanks in advance...

John

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/856133dc-c4ae-4cfc-adab-39453671d76d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/kz0eFP7nZMU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoD1S47%2Bdu4hU8wAugzJW4LnWgP4A2XhjARLBnP2hvStJA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4