Aggregation: Removing empty buckets from the response


(John Harding) #1

I'm using nested aggregations to find duplicates in a nested array. I only expect there to be few duplicates (say 3 duplicates out of 500 possible values across 10,000 documents). The response I get back from my query includes 500 buckets (one for each possible value) but 497 of them contain empty sub-buckets (meaning that those property names aren't duplicated). Is there a way to only return the 3 buckets that contain sub-buckets with data?

I'm not sure I explained that very well...

Here's my query (courtesy of @dantuff):

{
   "size": 0,
   "query": {
      "match_all": {}
   },
   "aggs": {
      "nested_properties": {
         "nested": {
            "path": "properties"
         },
         "aggs": {
            "display_name": {
               "terms": {
                  "field": "properties.display.raw",
                  "size": 1000000
               },
               "aggs": {
                  "doc_id": {
                     "terms": {
                        "field": "_uid",
                        "min_doc_count": 2,
                        "size": 1000000
                     },
                     "aggs": {
                        "properties_to_elements": {
                            "reverse_nested": {},
                            "aggs": {
                                "element_id": {
                                    "top_hits": {
                                        "_source": {
                                            "include": [ "system.id", "system.sourceid" ]
}}}}}}}}}}}}}

And here are the first two top level buckets. Notice that the first bucket (key: "Length") contains a doc_id.buckets[] with data in it, whereas the second bucket (key: "Family Name") contains an empty doc_id.buckets[].

   "aggregations": {
      "nested_properties": {
         "doc_count": 16487,
         "display_name": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
               {
                  "key": "Length",
                  "doc_count": 391,
                  "doc_id": {
                     "doc_count_error_upper_bound": 0,
                     "sum_other_doc_count": 0,
                     "buckets": [
                        {
                           "key": "building-elements#AU3oVn402bOX65T_n7uZ",
                           "doc_count": 2,
                           "properties_to_elements": {
                              "doc_count": 1,
                              "element_id": {
                                 "hits": {
                                    "total": 1,
                                    "max_score": 1,
                                    "hits": [
                                       {
                                          "_index": "5",
                                          "_type": "building-elements",
                                          "_id": "AU3oVn402bOX65T_n7uZ",
                                          "_score": 1,
                                          "_source": {
                                             "system": {
                                                "id": 263,
                                                "sourceid": "f02ec989-d9a7-4c4b-bcc3-749786b1bfc1-00044f1a"
                                             }}]}}}}] } },
               {
                  "key": "Family Name",
                  "doc_count": 382,
                  "doc_id": {
                     "doc_count_error_upper_bound": 0,
                     "sum_other_doc_count": 0,
                     "buckets": []
                  }
               },
               ....
               LOTS OF OTHER EMPTY BUCKETS DELETED
               ....
            ]}}}}

Is there a way to have ES just return the results that don't have an empty doc_id.buckets array?
If there isn't then is there a way to sort the result so that the results with a non-empty doc_id.buckets array are at the top of the results?

Many thanks,
John H.


(Mark Walkom) #2

This might help! https://www.elastic.co/guide/en/elasticsearch/guide/current/_dealing_with_null_values.html


(system) #3