Query with aggregation and return single field in each bucket

Hi all:

Sorry for the weird title. I'm trying to come up with a query solution that gives us both results for an aggregation and certain field of all docs within each bucket, but couldn't find an optimal way.

Currently we store host data with tags associated with each host. We want our app to be able to aggregate by one or more tags, and also return the host ids within each group of hosts. Right now the solution is fairly complicated: we store tags as nested objects, something like {tag_key: "zone": tag_val: "east-1a"} in each host doc, when people group hosts by zone, the query is:

{
  "aggs": {
    "tags": {
      "aggs": {         ------------------------- 1
        "tag_key": {
          "filter": {
            "term": {
              "tags_nested.tag_key": "zone"
            }
          },
          "aggs": {          ------------------- 2
            "tag_val": {
              "terms": {
                "field": "tags_nested.tag_val",
                "size": 300000
              },
              "aggs": {  ------------------------ 3
                "hosts": {
                  "aggs": {  ---------------------- 4
                    "node_id": {
                      "terms": {
                        "field": "id",
                        "size": 300000
                      }
                    }
                  },
                  "reverse_nested": {}
                }
              }
            }
          }
        }
      },
      "nested": {
        "path": "tags_nested"
      }
    }
  },
  "size": 0
}

step 1: filter down to hosts that have the particular tag key we are grouping by
step 2: “terms” aggregation (aka bucketing) to bucket by all possible values of that tag key
step 3: “reverse nested” aggregation to pop out of the nested document back into the main host document
step 4: another “terms” aggregation to bucket by host id (there should only be one per bucket)

This query gives us what we want, but knowing that the dataset would grow much larger, the query can't stay long performance wise. What would be a better way of doing this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.