Alert on repeated ping failures

I'd like to set up:

  1. a dashboard showing the ping status of a number of systems (perhaps a table of system names and their status)
  2. alerts indicating when systems have stopped responding to the last X ping attempts.

I currently have Heartbeat pinging a number of systems and am getting an index with @timestamp, but it's not clear to me how to set up the alerts or dashboard from there. Any help would be appreciated.

You can build your own dashboards in Kibana. I would start with the Time Series Visual Builder if I were creating a new visualization.

This is one that I use with Heartbeat.

Then for Altering I use Watcher which is part of X-Pack. This will create a watch that queries the data every 60s and looks for hosts that were down and sends me a Slack notification.

PUT _xpack/watcher/watch/heartbeat-monitor-status-down
{
    "trigger": {
      "schedule": {
        "interval": "1m"
      }
    },
    "input": {
      "search": {
        "request": {
          "search_type": "query_then_fetch",
          "indices": [
            "heartbeat-*"
          ],
          "types": [],
          "body": {
            "size": 0,
            "query": {
              "bool": {
                "must": [
                  {
                    "term": {
                      "monitor.status": {
                        "value": "down"
                      }
                    }
                  }
                ],
                "filter": [
                  {
                    "range": {
                      "@timestamp": {
                        "from": "now-1m"
                      }
                    }
                  }
                ]
              }
            },
            "aggregations": {
              "by_monitors": {
                "terms": {
                  "field": "monitor.id",
                  "size": 10,
                  "min_doc_count": 1
                }
              }
            }
          }
        }
      }
    },
    "condition": {
      "compare": {
        "ctx.payload.hits.total": {
          "gt": 0
        }
      }
    },
    "actions": {
      "notify-slack": {
        "throttle_period_in_millis": 900000,
        "slack": {
          "account": "monitoring",
          "message": {
            "from": "Heartbeat",
            "text": "Some hosts are unresponsive.",
            "dynamic_attachments": {
              "list_path": "ctx.payload.aggregations.by_monitors.buckets",
              "attachment_template": {
                "color": "warning",
                "title": "{{key}}",
                "text": "Total events: {{doc_count}}"
              }
            }
          }
        }
      }
    }
}
1 Like

@andrewkroh, thanks for your reply. I'm not sure how to re-create your dashboard, though:

For the ICMP RTT Times time series visual builder what are you specifying for the aggregation and grouping?

For the up and down count, I guess these are Data > Metric visualizations, but what do you specify for the metrics/buckets?

For the alert, where does that definition go? Can I enter that in through the Kibana interface?

I guess under New Watch I should choose Advanced Watch vs. Threshold Alert to be able to enter the JSON definition.

I built mine before TSVB existed so I used a Line Chart. But conceptually they will be the same. It's a metric agg on the max icmp.rtt.us value. And each line represents a single monitor.ip so group by that value.

The up/down metrics are unique counts of the monitor.id with a query of either monitor.status:up or monitor.status:down.

Ok, thanks.

I haven't managed to reproduce the line chart. On the Metrics Y-axis I have max icmp.rtt.us. In the Buckets section I have X-Axis date histogram by @timestamp, 30m interval, and Split Series terms by monitor.host. I see some dots/circles but no lines.

For the up/down metrics I managed to get something approximately like what you have by using a metric of unique count of monitor.id with buckets split group by terms on monitor status. But otherwise I didn't see how/where to specify a query of either monitor.status:up or monitor.status:down ...

monitor.status: up would go into the text box that says "Search...". This will filter things such that the aggregation only includes those that are up.

Thanks Andrew, your screenshot helped me reproduce the line chart.

For the up/down metric, I didn't see any text box that says "Search..." (there's "JSON Input", "Exclude", "Include") but I found under Buckets > Split Group > Aggregation > Filters that I can filter for "monitor.status:down".

But it looks like for the up/down metric to be useful I would also need it to only count ping records that are from the last X minutes only, and it's not clear how to do that at the same time as having that filter.

It’s shown in the image you posted. It’s near the top. Right under “Visualize”.

Ah, thanks Andrew! I didn't realize that that bar was part of the visualization definition ... I was only looking under Data and Options under the index pattern.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.