When running a Terms Aggregation query setting min_doc_count: 0 doesn't return terms with no values with zero. is there anything that needs to be checked.
We only gather the values for matching docs. If you want to get aggregations on values in docs other than just those that match your query see the ‘global’ aggregation for a broader scope
My understanding from the documentation is that when min_doc_count is set to 0 - all the terms in the search range will be returned even if that term do not have any documents in that range... Is my understanding right ? I am not sure if i am missing something.
My bad. You’re right. It should return non-matching terms but subject to the ‘size’ restriction for how many terms to bring back (default 10)
My size is 15, Behavior is like below
- For a given time in a time range i am getting 3 terms with 0,
- For another time in a time range i am getting 5 terms with 0
3 Another time in a time range i am getting 2 terms with 0
with min_doc_count - we should be getting the same number of terms for all the time in a given range. is there anything should i check as to why the terms returned are not same for all time in the searched time range ?
How many non-zero-count terms do you get in each of those ranges? It might be that they are more competitive choices for making up your top 15
I am supposed to get 5 terms all the time, non zero terms varies depending on the time in the given range. May i know what is the impact of size in this behavior ? I am not clear on that part
In all of them max 1 or 2 are the non-zero terms
I think I need to see some JSON to make requests/responses clearer
For the below request
"query": {
"match_all": {}
},
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "timestamp",
"interval" : "month"
},
"aggs": {
"tags": {
"terms": {
"field": "tags"
}
}
}
}
}
Response is like below
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11,
"max_score": 0,
"hits": []
},
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2016-01-01T00:00:00.000Z",
"key": 1451606400000,
"doc_count": 5,
"tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "blue",
"doc_count": 3
},
{
"key": "green",
"doc_count": 2
},
{
"key": "red",
"doc_count": 1
}
]
}
},
{
"key_as_string": "2016-02-01T00:00:00.000Z",
"key": 1454284800000,
"doc_count": 5,
"tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "green",
"doc_count": 4
},
{
"key": "red",
"doc_count": 1
}
]
}
},
{
"key_as_string": "2016-03-01T00:00:00.000Z",
"key": 1456790400000,
"doc_count": 1,
"tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "blue",
"doc_count": 1
}
]
}
}
]
}
}
}
In the above response for response key "1456790400000" i am expecting "Green" & "RED" also in the bucket with value as zero. But i am getting only "Blue" with Value "1" ....
Also for response key "1454284800000" i need "BLUE" in the buckets with "0"
Precisely i want all the colors in the buckets returned with "0" even if they are not present.
Please let me know if any way we could achieve this ?
Add "min_doc_count":0
to the terms agg?
In spite of adding min_doc_count - I am not getting the intended results. Let me know if something is missing ....
Working here on 6.6.0:
Setup:
DELETE test
PUT test
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"_doc":{
"properties":{
"timestamp":{
"type":"date"
},
"tags":{
"type":"keyword"
}
}
}
}
}
POST test/_doc/1
{
"timestamp":"2016-02-01T00:00:00.000Z",
"tags" : ["green", "red"]
}
POST test/_doc/2
{
"timestamp":"2016-03-01T00:00:00.000Z",
"tags" : ["blue"]
}
POST test/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"sales_per_month": {
"date_histogram": {
"field": "timestamp",
"interval": "month"
},
"aggs": {
"tags": {
"terms": {
"field": "tags",
"min_doc_count": 0
}
}
}
}
}
}
Response:
{
"took" : 94,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"sales_per_month" : {
"buckets" : [
{
"key_as_string" : "2016-02-01T00:00:00.000Z",
"key" : 1454284800000,
"doc_count" : 1,
"tags" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "green",
"doc_count" : 1
},
{
"key" : "red",
"doc_count" : 1
},
{
"key" : "blue",
"doc_count" : 0
}
]
}
},
{
"key_as_string" : "2016-03-01T00:00:00.000Z",
"key" : 1456790400000,
"doc_count" : 1,
"tags" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "blue",
"doc_count" : 1
},
{
"key" : "green",
"doc_count" : 0
},
{
"key" : "red",
"doc_count" : 0
}
]
}
}
]
}
}
}
I tried this in the application where there are millions of docs. Inspite of setting min-doc-count to 0 , the terms with no docs are not returned with zero. Would be great if you could let me know under what conditions this could happen ? Thanks in advance.
Maybe you have a lot of competing terms and size
or shard_size
settings that aren't high enough to consider all the values that should be returned?
the field we are trying to do term aggregation It has high cardinality around 4200. If that would cause any impact , what should be the approach to handle this situation ?
Increase the shard_size
as per the guidance in the docs.
However, consider how much data you're asking for if you're nesting the terms
agg under a date_histogram
as per your example as the number of results becomes 4,200 x numberOfDateBuckets.
shardsize by default is same as Size. I tried this and still the same issue. Let me know if anything i need to check why this issue is happening only when there is high cardinality ?
I'm unclear - what numbers did you set for shard_size
. and size
?
shard_size and size both were set to 500
I'm confused then. You said you wanted all terms (even those with doc_count=0) and that there were 4,200 of them. Why pick sizes less than 4,200 if that's the case?