Trying to use sum_bucket agg to summarize the last value per server into a total

I'm trying to create a search (ultimately a visualization) that will give me the total number of Controller nodes in a cluster. Each node reports metrics every 15 seconds, and each document will contain the value for that specific node, with the active controller carrying "1" as the value, and all others "0".

My goal is to get the latest value for each node, and then sum those values. If the result is anything other than 1, that's bad. Obviously, if I get values from more documents than just the last one per each node, the result will be wrong.

This is my current query:

GET metricbeat-8.6.2/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "exists": {
            "field": "prometheus.metrics.kafka_controller_active_controller_count"
          }
        }
      ],
      "must": [
        {
          "terms": {
            "host.name": [
              "node1",
              "node2",
              "node3",
              "node4"
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "Per_Broker": {
      "terms": {
        "field": "host.name",
        "size": 4
      },
      "aggs": {
        "Last_Value": {
          "terms": {
            "field": "@timestamp",
            "size": 1,
            "order": {
              "_key": "desc"
            }
          },
          "aggs": {
            "Active_Controller": {
              "sum": {
                "field": "prometheus.metrics.kafka_controller_active_controller_count"
              }
            }
          }
        }
      }
    },
    "Number_of_Active_Controllers": {
      "sum_bucket": {
        "buckets_path": "Per_Broker['Last_Value'].Active_Controller"
      }
    }
  },
  "_source": false,
  "fields": [
    "host.name",
    "prometheus.metrics.kafka_controller_active_controller_count"
  ],
  "size": 0
}

The result:

{
  "took": 19,
  "timed_out": false,
  "_shards": {
    "total": 19,
    "successful": 19,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4043,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "Per_Broker": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "node1",
          "doc_count": 1081,
          "Last_Value": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 1080,
            "buckets": [
              {
                "key": 1698237171688,
                "key_as_string": "2023-10-25T12:32:51.688Z",
                "doc_count": 1,
                "Active_Controller": {
                  "value": 0
                }
              }
            ]
          }
        },
        {
          "key": "node2",
          "doc_count": 1026,
          "Last_Value": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 1025,
            "buckets": [
              {
                "key": 1698237168398,
                "key_as_string": "2023-10-25T12:32:48.398Z",
                "doc_count": 1,
                "Active_Controller": {
                  "value": 1
                }
              }
            ]
          }
        },
        {
          "key": "node3",
          "doc_count": 992,
          "Last_Value": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 991,
            "buckets": [
              {
                "key": 1698237165024,
                "key_as_string": "2023-10-25T12:32:45.024Z",
                "doc_count": 1,
                "Active_Controller": {
                  "value": 0
                }
              }
            ]
          }
        },
        {
          "key": "node4",
          "doc_count": 944,
          "Last_Value": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 943,
            "buckets": [
              {
                "key": 1698237161746,
                "key_as_string": "2023-10-25T12:32:41.746Z",
                "doc_count": 1,
                "Active_Controller": {
                  "value": 0
                }
              }
            ]
          }
        }
      ]
    },
    "Number_of_Active_Controllers": {
      "value": 0
    }
  }
}

No matter how many times I run it, the Number_of_Active_Controllers bucket has the value 0, even though at least one Active_Controller bucket has the value 1 every single time. I would expect this to be 1, since 0 + 1 + 0 + 0 = 1.

Do I have a problem in my bucket path? This was the only way I could get both a valid path at all, and a metric that was a number, not an object.

Bonus question: how would I translate this into a Kibana Metric or Gauge visualization? I've tried every which way, but I can't find any way to get specifically the last values per node, the way I can with the above query.

Yeah I don't think your bucket path is correct the Key is not Last_Value it would be an actual value ... which in this case is a @timestamp

I don't have test data to try this on... but you want the first bucket did you try?

"buckets_path": "Per_Broker[0].Active_Controller"

it may even be this

"buckets_path": "Per_Broker>Last_Value[0].Active_Controller"

Or this

"buckets_path": "Per_Broker>Last_Value.0.Active_Controller"

Because it is Last_Value that is a multi bucket and you want the first one, even though you only have one...

"buckets_path": "Per_Broker>Last_Value[0].Active_Controller" logically works, but doesn't give the expected result either.

"buckets_path": "Per_Broker>Last_Value.0.Active_Controller" doesn't work at all, it gives the following error: Validation Failed: 1: No aggregation [Last_Value.0] found for path [Per_Broker>Last_Value.0.Active_Controller];

"buckets_path": "Per_Broker[0].Active_Controller" also somehow logically works, but I can't really see why, because to me it skips Last_Value altogether, and either way it's not the first broker I want, I want all brokers and the Active_Controller value for each broker, summed together.

The data, at least to reproduce this example, is very simple:

host.name is a keyword field, containing the hostname

@timestamp is, obviously, the timestamp of the event

prometheus.metrics.kafka_controller_active_controller_count (call it whatever you want, really) is just a simple integer value.

Each server, in my case four, produces a value every 15 seconds by way of Metricbeat collecting the data from a JMX plugin. That means every 15 seconds there is a snapshot of the state of the cluster, with (hopefully) just one of the results containing "1", and all the others "0".

Edit: Some other combinations I tried:

"buckets_path": "Per_Broker[Last_Value.0].Active_Controller" works, but the result in the sum_bucket is still 0...

"buckets_path": "Per_Broker[Last_Value[0]].Active_Controller" gives error: Validation Failed: 1: No aggregation found for path [Per_Broker[Last_Value[0]].Active_Controller];

"buckets_path": "Per_Broker>Last_Value[0]>Active_Controller" works, but also still 0

Edit 2: I guess I realize I don't at all understand the exact structure of a "bucket path", or what makes one bucket different from another in a way that requires the path to be different. Either way, it's weird how so many variations of these paths work, as in they give no errors and allow a result to emerge in the sum_bucket, but I have no idea why because I feel like they shouldn't, nor do I know what values it's fetching, since the result is never the one that is expected.

From what I've read, terms is considered a multi-bucket aggregation. Since I use terms twice, that must then mean I have a path consisting of a multi-bucket inside another multi-bucket. The last aggregation is a metric aggregation, not a bucket, that should sum up all the (size: 1) documents in the parent multi-bucket.

So, I guess the path would somehow have to be multi_bucket>multi_bucket.metric, but since I want the equivalent of for each in both multi-buckets, I don't know how to write that, or if that's even possible?

OK, I believe I may have solved it. Like I half suspected, and which was suggested in another forum post thread I found, sum_bucket cannot go so many levels deep, but that person solved it by doing intermediate sum buckets, and then summing them in the end. Brilliant? I reserve my judgment, but at least it does seem to genuinely work.

  "aggs": {
    "Per_Broker": {
      "terms": {
        "field": "host.name",
        "size": 4
      },
      "aggs": {
        "Last_Value": {
          "terms": {
            "field": "@timestamp",
            "size": 1,
            "order": {
              "_key": "desc"
            }
          },
          "aggs": {
            "Active_Controller": {
              "sum": {
                "field": "prometheus.metrics.kafka_controller_active_controller_count"
              }
            }
          }
        },
        "Is_Controller": {
          "sum_bucket": {
            "buckets_path": "Last_Value.Active_Controller"
          }
        }
      }
    },
    "Num_Controllers": {
      "sum_bucket": {
        "buckets_path": "Per_Broker.Is_Controller"
      }
    }
  }

I can confirm it works as I need it to by increasing the limit on the @timestamp terms agg. By setting it to 2, I do the intermediate sum on the same field twice, which is the calculated into the final sum_bucket as well, meaning even if two different servers were controllers simultaneously, rather than the same one at two points in time, the end result would also be 2 and therefor signal a problem.

Now to figure out how the hell I turn this into a Kibana visualization...

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.