How do I consume Heartbeat data for meaningful alerting?

I have created a Watcher that consumes the data from a named, icmp, Heartbeat monitor [1] running as a Kubernetes DaemonSet [2]. I have not yet gotten approval to post the configuration of the Watcher, so I will provide a high-level description of how I get the data.

The watcher includes the following fields which are documented here [3] and here [4]:

  • agent.hostname
  • url.domain
  • monitor.type

To get to my desired data:

  • Documents are filtered using a the "icmp" monitor type and a provided monitor name
  • Documents are separated into unique buckets using agent.hostname (source) and url.domain (dest) [5]
  • Once the documents are separated by into unique buckets with a max bucket count of 10k (this may be the maximum)
  • A max aggregation [6] is performed on each of those composite buckets

Heartbeat is configured to create a ping every 5 seconds; both the Input [6] and Trigger [7] are configured to look back 5m, so I expect 60 documents per pair over that duration.
On my largest Kubernetes cluster, there are 23 nodes, so I expect 60 * 23^2 (~63.5k) documents across 23^2 * 2 (~1k) buckets (23^2 for composite, x2 because each composite gets a max agg). Even though my max buckets is configured at 10k buckets, I have noticed that I get an after_key [8] which truncates all pairs after around 16 (when iterating a-a -> w-w).

I would like to know what I should be doing to return the entire output for consumption by my Watcher. I would also like to know how to address the cases where I scale from 2 to n pages of output based on the number of buckets I'm returning. I believe the Chain Input [9] solves the problem with a known number of pages, but I am not sure if this is the best and/or most scalable approach.

As an aside, if/when I am able to share this Watcher code, where would be the best public repo to keep it?


(edit: fixed formatting issue)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.