Filter based on the doc_count with aggregations


(Yeikel) #1

I have the following index :

POST /cars/transactions/_bulk
    { "index": {}}
    { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
    { "index": {}}
    { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
    { "index": {}}
    { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
    { "index": {}}
    { "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
    { "index": {}}
    { "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
    { "index": {}}
    { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
    { "index": {}}
    { "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
    { "index": {}}
    { "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

And I am performing the following search :

    GET /cars/transactions/_search
    {
        "size" : 0,
        "aggs" : { 
            "popular_colors" : { 
                "terms" : { 
                  "field" : "color"
                }
            }
        }
    }

The response that I receive is the following :

    {
      "took": 2,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 8,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "popular_colors": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "red",
              "doc_count": 4
            },
            {
              "key": "blue",
              "doc_count": 2
            },
            {
              "key": "green",
              "doc_count": 2
            }
          ]
        }
      }
    }

My question is , how can I filter by doc_count?

For example "return only the documents where doc_count is equal to 1"

I am using Elasticsearch 2.3.5

Thank you


How can I filter by count? / How can I apply a custom filter?
Filter based on the doc_count with aggregations (2)
(Christoph) #2

Hi,

you can use the bucket_selector pipeline aggregation for this kind of filtering. In your case, the following query:

GET /cars/transactions/_search
{
   "size": 0,
   "aggs": {
      "popular_colors": {
         "terms": {
            "field": "color"
         },
         "aggs": {
            "my_filter": {
               "bucket_selector": {
                  "buckets_path": {
                     "the_doc_count": "_count"
                  },
                  "script": "the_doc_count == 2"
               }
            }
         }
      }
   }
}

Should only filter out the buckets with "doc_count" : 2. However, be aware that Pipeline aggregations work on the outputs produced from other aggregations, so the overall amount of work that needs to be done to calculate the initial doc_counts will be the same. Since the script parts needs to be executed for each input bucket, the opetation might potentially be slow for high cardinality fields (as in thousands of thousands of terms), but it should work well for relatively low cardinality fields (like colors, as in this case).


(system) #3