Alert on repeated ping failures


#1

I'd like to set up:

  1. a dashboard showing the ping status of a number of systems (perhaps a table of system names and their status)
  2. alerts indicating when systems have stopped responding to the last X ping attempts.

I currently have Heartbeat pinging a number of systems and am getting an index with @timestamp, but it's not clear to me how to set up the alerts or dashboard from there. Any help would be appreciated.


(Andrew Kroh) #2

You can build your own dashboards in Kibana. I would start with the Time Series Visual Builder if I were creating a new visualization.

This is one that I use with Heartbeat.

Then for Altering I use Watcher which is part of X-Pack. This will create a watch that queries the data every 60s and looks for hosts that were down and sends me a Slack notification.

PUT _xpack/watcher/watch/heartbeat-monitor-status-down
{
    "trigger": {
      "schedule": {
        "interval": "1m"
      }
    },
    "input": {
      "search": {
        "request": {
          "search_type": "query_then_fetch",
          "indices": [
            "heartbeat-*"
          ],
          "types": [],
          "body": {
            "size": 0,
            "query": {
              "bool": {
                "must": [
                  {
                    "term": {
                      "monitor.status": {
                        "value": "down"
                      }
                    }
                  }
                ],
                "filter": [
                  {
                    "range": {
                      "@timestamp": {
                        "from": "now-1m"
                      }
                    }
                  }
                ]
              }
            },
            "aggregations": {
              "by_monitors": {
                "terms": {
                  "field": "monitor.id",
                  "size": 10,
                  "min_doc_count": 1
                }
              }
            }
          }
        }
      }
    },
    "condition": {
      "compare": {
        "ctx.payload.hits.total": {
          "gt": 0
        }
      }
    },
    "actions": {
      "notify-slack": {
        "throttle_period_in_millis": 900000,
        "slack": {
          "account": "monitoring",
          "message": {
            "from": "Heartbeat",
            "text": "Some hosts are unresponsive.",
            "dynamic_attachments": {
              "list_path": "ctx.payload.aggregations.by_monitors.buckets",
              "attachment_template": {
                "color": "warning",
                "title": "{{key}}",
                "text": "Total events: {{doc_count}}"
              }
            }
          }
        }
      }
    }
}

#3

@andrewkroh, thanks for your reply. I'm not sure how to re-create your dashboard, though:

For the ICMP RTT Times time series visual builder what are you specifying for the aggregation and grouping?

For the up and down count, I guess these are Data > Metric visualizations, but what do you specify for the metrics/buckets?

For the alert, where does that definition go? Can I enter that in through the Kibana interface?


#4

I guess under New Watch I should choose Advanced Watch vs. Threshold Alert to be able to enter the JSON definition.


(Andrew Kroh) #5

I built mine before TSVB existed so I used a Line Chart. But conceptually they will be the same. It's a metric agg on the max icmp.rtt.us value. And each line represents a single monitor.ip so group by that value.

The up/down metrics are unique counts of the monitor.id with a query of either monitor.status:up or monitor.status:down.


#6

Ok, thanks.

I haven't managed to reproduce the line chart. On the Metrics Y-axis I have max icmp.rtt.us. In the Buckets section I have X-Axis date histogram by @timestamp, 30m interval, and Split Series terms by monitor.host. I see some dots/circles but no lines.

For the up/down metrics I managed to get something approximately like what you have by using a metric of unique count of monitor.id with buckets split group by terms on monitor status. But otherwise I didn't see how/where to specify a query of either monitor.status:up or monitor.status:down ...


(Andrew Kroh) #7


(Andrew Kroh) #8

monitor.status: up would go into the text box that says "Search...". This will filter things such that the aggregation only includes those that are up.


#9

Thanks Andrew, your screenshot helped me reproduce the line chart.

For the up/down metric, I didn't see any text box that says "Search..." (there's "JSON Input", "Exclude", "Include") but I found under Buckets > Split Group > Aggregation > Filters that I can filter for "monitor.status:down".

But it looks like for the up/down metric to be useful I would also need it to only count ping records that are from the last X minutes only, and it's not clear how to do that at the same time as having that filter.


(Andrew Kroh) #10

It’s shown in the image you posted. It’s near the top. Right under “Visualize”.


#11

Ah, thanks Andrew! I didn't realize that that bar was part of the visualization definition ... I was only looking under Data and Options under the index pattern.


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.