How to use select bucket aggregation

I want to get a list of processes that utilize more than 1% CPU. So I created a nested bucket aggregation query which ends in an avg aggregation:

GET metricbeat-*/_search
{
  "aggs": {
    "host": {
      "terms": {
        "field": "agent.hostname"
      },
      "aggs": {
        "user": {
          "terms": {
            "field": "user.name"
          },
          "aggs": {
            "process": {
              "terms": {
                "field": "process.name"
              },
              "aggs": {
                "pid": {
                  "terms": {
                    "field": "process.pid"
                  },
                  "aggs": {
                    "cpu": {
                      "avg": {
                        "field": "system.process.cpu.total.pct"
                      }
                    },
                    "cpu_bucket_selector": {
                      "bucket_selector": {
                        "buckets_path": {
                          "avg_cpu": "cpu"
                        },
                        "script": "params.avg_cpu > 0.01"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0,
  "stored_fields": [
    "*"
  ],
  "script_fields": {},
  "docvalue_fields": [
    {
      "field": "@timestamp",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m/s"
            }
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

Values below 1% are filtered but this also leads to empty "pid" buckets in the results list. I tried to define the bucket selector at the very first parent aggregation like this:

GET metricbeat-*/_search
{
  "aggs": {
    "host": {
      "terms": {
        "field": "agent.hostname"
      },
      "aggs": {
        "user": {
          "terms": {
            "field": "user.name"
          },
          "aggs": {
            "process": {
              "terms": {
                "field": "process.name"
              },
              "aggs": {
                "pid": {
                  "terms": {
                    "field": "process.pid"
                  },
                  "aggs": {
                    "cpu": {
                      "avg": {
                        "field": "system.process.cpu.total.pct"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    },
    "cpu_bucket_selector": {
      "bucket_selector": {
        "buckets_path": {
          "avg_cpu": "host>user>process>pid>cpu"
        },
        "script": "params.avg_cpu > 0.01"
      }
    }
  },
  "size": 0,
  "stored_fields": [
    "*"
  ],
  "script_fields": {},
  "docvalue_fields": [
    {
      "field": "@timestamp",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m/s"
            }
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

But this gives me an error:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "action_request_validation_exception",
        "reason" : "Validation Failed: 1: bucket_selector aggregation [cpu_bucket_selector] must be declared inside of another aggregation;"
      }
    ],
    "type" : "action_request_validation_exception",
    "reason" : "Validation Failed: 1: bucket_selector aggregation [cpu_bucket_selector] must be declared inside of another aggregation;"
  },
  "status" : 400
}

Whats the issue and am I even on the right path?

Just to make sure I understand the request, you are searching for any data, where the pct is greater than 0.01? If so, why don't you put this into the search request instead of using an aggregation for this? That would be much more efficient, as a lot less document would be returned and required to be parsed. You could still run an aggregation for the process name/pid on top of that data.

Hope that helps.

The value 0.01 is an example and meant to be variable. The Query is: "give me all processes that used more than x percent of cpu during the last 5 minutes".

Anyway efficiency is not my problem. I want to filter whole branches depending on the value of the avg cpu aggregation. The bucket selector in the first example is working fine except that it returns branches without leaves sometimes. So i try to move the bucket selector further up in the aggregation tree but i can't figure the right parameters.

I misread your requirement, not it seems more clear, thx for explaining.

Is it possible, that some buckets don't have any documents and thus you end up with empty buckets. You could try out the min_doc_count param for some aggs - but that is just a guess without seeing your data.

Pipeline aggs cannot be put in the 'root' elements of aggs - this is what the exception message tries to tell you by hinting to added inside of another agg.

The buckets have always data just until i sort them out with the bucket selector