Filter based on the doc_count with aggregations

yeikel · October 11, 2016, 2:09am

I have the following index :

POST /cars/transactions/_bulk
    { "index": {}}
    { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
    { "index": {}}
    { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
    { "index": {}}
    { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
    { "index": {}}
    { "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
    { "index": {}}
    { "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
    { "index": {}}
    { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
    { "index": {}}
    { "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
    { "index": {}}
    { "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

And I am performing the following search :

    GET /cars/transactions/_search
    {
        "size" : 0,
        "aggs" : { 
            "popular_colors" : { 
                "terms" : { 
                  "field" : "color"
                }
            }
        }
    }

The response that I receive is the following :

    {
      "took": 2,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 8,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "popular_colors": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "red",
              "doc_count": 4
            },
            {
              "key": "blue",
              "doc_count": 2
            },
            {
              "key": "green",
              "doc_count": 2
            }
          ]
        }
      }
    }

My question is , how can I filter by doc_count?

For example "return only the documents where doc_count is equal to 1"

I am using Elasticsearch 2.3.5

Thank you

cbuescher · October 11, 2016, 12:43pm

Hi,

you can use the bucket_selector pipeline aggregation for this kind of filtering. In your case, the following query:

GET /cars/transactions/_search
{
   "size": 0,
   "aggs": {
      "popular_colors": {
         "terms": {
            "field": "color"
         },
         "aggs": {
            "my_filter": {
               "bucket_selector": {
                  "buckets_path": {
                     "the_doc_count": "_count"
                  },
                  "script": "the_doc_count == 2"
               }
            }
         }
      }
   }
}

Should only filter out the buckets with "doc_count" : 2. However, be aware that Pipeline aggregations work on the outputs produced from other aggregations, so the overall amount of work that needs to be done to calculate the initial doc_counts will be the same. Since the script parts needs to be executed for each input bucket, the opetation might potentially be slow for high cardinality fields (as in thousands of thousands of terms), but it should work well for relatively low cardinality fields (like colors, as in this case).

Topic		Replies	Views
Filter based on the doc_count with aggregations (2) Elasticsearch	2	4915	October 5, 2018
How to aggregate by doc_count? Elasticsearch	9	2296	January 14, 2020
ES filter composite aggregation result by doc_count Kibana	4	865	September 6, 2022
ElasticSearch - Filter Buckets Elasticsearch	2	320	January 26, 2021
Show aggregations of same category after filter Elasticsearch	1	108	November 30, 2022

Filter based on the doc_count with aggregations

Related topics