Terms Agg


(aniaks) #1

When running a Terms Aggregation query setting min_doc_count: 0 doesn't return terms with no values with zero. is there anything that needs to be checked.


(Mark Harwood) #2

We only gather the values for matching docs. If you want to get aggregations on values in docs other than just those that match your query see the ‘global’ aggregation for a broader scope


(aniaks) #3

My understanding from the documentation is that when min_doc_count is set to 0 - all the terms in the search range will be returned even if that term do not have any documents in that range... Is my understanding right ? I am not sure if i am missing something.


(Mark Harwood) #4

My bad. You’re right. It should return non-matching terms but subject to the ‘size’ restriction for how many terms to bring back (default 10)


(aniaks) #5

My size is 15, Behavior is like below

  1. For a given time in a time range i am getting 3 terms with 0,
  2. For another time in a time range i am getting 5 terms with 0
    3 Another time in a time range i am getting 2 terms with 0

with min_doc_count - we should be getting the same number of terms for all the time in a given range. is there anything should i check as to why the terms returned are not same for all time in the searched time range ?


(Mark Harwood) #6

How many non-zero-count terms do you get in each of those ranges? It might be that they are more competitive choices for making up your top 15


(aniaks) #7

I am supposed to get 5 terms all the time, non zero terms varies depending on the time in the given range. May i know what is the impact of size in this behavior ? I am not clear on that part

In all of them max 1 or 2 are the non-zero terms


(Mark Harwood) #8

I think I need to see some JSON to make requests/responses clearer


(aniaks) #9

For the below request

    "query": {
        "match_all": {}
    }, 
    "size": 0, 
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "timestamp",
                "interval" : "month"
            },
            "aggs": {
                "tags": {
                    "terms": {
                        "field": "tags"
                    }
                }
            }
        }
    }

Response is like below

   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 11,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2016-01-01T00:00:00.000Z",
               "key": 1451606400000,
               "doc_count": 5,
               "tags": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "blue",
                        "doc_count": 3
                     },
                     {
                        "key": "green",
                        "doc_count": 2
                     },
                     {
                        "key": "red",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key_as_string": "2016-02-01T00:00:00.000Z",
               "key": 1454284800000,
               "doc_count": 5,
               "tags": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "green",
                        "doc_count": 4
                     },
                     {
                        "key": "red",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key_as_string": "2016-03-01T00:00:00.000Z",
               "key": 1456790400000,
               "doc_count": 1,
               "tags": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "blue",
                        "doc_count": 1
                     }
                  ]
               }
            }
         ]
      }
   }
}

In the above response for response key "1456790400000" i am expecting "Green" & "RED" also in the bucket with value as zero. But i am getting only "Blue" with Value "1" ....

Also for response key "1454284800000" i need "BLUE" in the buckets with "0"
Precisely i want all the colors in the buckets returned with "0" even if they are not present.

Please let me know if any way we could achieve this ?


(Mark Harwood) #10

Add "min_doc_count":0 to the terms agg?


(aniaks) #11

In spite of adding min_doc_count - I am not getting the intended results. Let me know if something is missing ....


(Mark Harwood) #12

Working here on 6.6.0:

Setup:

DELETE test
PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_doc":{
      "properties":{
        "timestamp":{
          "type":"date"
        },
        "tags":{
          "type":"keyword"
        }
      }
    }
  }
}
POST test/_doc/1
{
  "timestamp":"2016-02-01T00:00:00.000Z",
  "tags" : ["green", "red"]
}
POST test/_doc/2
{
  "timestamp":"2016-03-01T00:00:00.000Z",
  "tags" : ["blue"]
}
POST test/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "month"
      },
      "aggs": {
        "tags": {
          "terms": {
            "field": "tags",
            "min_doc_count": 0
          }
        }
      }
    }
  }
}

Response:

{
  "took" : 94,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "sales_per_month" : {
      "buckets" : [
        {
          "key_as_string" : "2016-02-01T00:00:00.000Z",
          "key" : 1454284800000,
          "doc_count" : 1,
          "tags" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "green",
                "doc_count" : 1
              },
              {
                "key" : "red",
                "doc_count" : 1
              },
              {
                "key" : "blue",
                "doc_count" : 0
              }
            ]
          }
        },
        {
          "key_as_string" : "2016-03-01T00:00:00.000Z",
          "key" : 1456790400000,
          "doc_count" : 1,
          "tags" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "blue",
                "doc_count" : 1
              },
              {
                "key" : "green",
                "doc_count" : 0
              },
              {
                "key" : "red",
                "doc_count" : 0
              }
            ]
          }
        }
      ]
    }
  }
}

(aniaks) #13

I tried this in the application where there are millions of docs. Inspite of setting min-doc-count to 0 , the terms with no docs are not returned with zero. Would be great if you could let me know under what conditions this could happen ? Thanks in advance.


(Mark Harwood) #14

Maybe you have a lot of competing terms and size or shard_size settings that aren't high enough to consider all the values that should be returned?


(aniaks) #15

the field we are trying to do term aggregation It has high cardinality around 4200. If that would cause any impact , what should be the approach to handle this situation ?


(Mark Harwood) #16

Increase the shard_size as per the guidance in the docs.
However, consider how much data you're asking for if you're nesting the terms agg under a date_histogram as per your example as the number of results becomes 4,200 x numberOfDateBuckets.


(aniaks) #17

shardsize by default is same as Size. I tried this and still the same issue. Let me know if anything i need to check why this issue is happening only when there is high cardinality ?


(Mark Harwood) #18

I'm unclear - what numbers did you set for shard_size . and size?


(aniaks) #19

shard_size and size both were set to 500


(Mark Harwood) #20

I'm confused then. You said you wanted all terms (even those with doc_count=0) and that there were 4,200 of them. Why pick sizes less than 4,200 if that's the case?