Watcher Heartbeat monitor.status query help

Hi,

I've got a Watch set up with Heartbeat input that sends an alert when any system is not responding to pings (i.e. has monitor.status == down) in the last X minutes.

Here is my watch at the moment:

{
  "trigger": {
    "schedule": {
      "interval": "15m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "heartbeat-*"
        ],
        "types": [],
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "term": {
                    "monitor.status": {
                      "value": "down"
                    }
                  }
                }
              ],
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "from": "now-15m"
                    }
                  }
                }
              ]
            }
          },
          "aggregations": {
            "by_monitors": {
              "terms": {
                "field": "monitor.host",
                "size": 100,
                "min_doc_count": 1
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gt": 0
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "profile": "standard",
        "to": [
          "###@###.##"
        ],
        "subject": "Unresponsive test systems",
        "body": {
          "html": "{{ctx.payload.hits.total}} system(s) not responding to pings:<P>{{#ctx.payload.aggregations.by_monitors.buckets}}{{key}}<BR>{{/ctx.payload.aggregations.by_monitors.buckets}}"
        }
      }
    }
  }
}

I'd like to change this so that it only triggers when a given system has monitor.status:down currently AND had monitor.status:up just before that (ex. perhaps now-30m)

How would that be done? Can a transform be used to search a second time using the monitor.host or monitor.id values returned by the above query to find the ones that has monitor.status:up earlier? Any examples or suggestions would be much appreciated.

hey,

this would be pretty complicated to do in such a generic watch that monitors all your hosts. If you have a watch per host, you could simply call the Ack Watch API yourself to silence a watch after it fires.

Hope this helps!

--Alex

@spinscale, thanks Alex. Right after firing an email and/or Slack notification, can the "actions" section of a watch be made to acknowledge the action to silence it until after the watch passes? Could you show me what that would look like?

ps. I'm currently completely stuck at the moment where I can't get back to editing my watches:

Is there a way I can edit or remove the problem watch so I can continue working?

Hey,

I think you may be running into this kibana issue: https://github.com/elastic/kibana/issues/18532

First, can you share the full watch and your elasticsearch.yml slack configuration? You can do this using the dev tools console and just run the Get Watch API

Also, which version of Elasticsearch and Kibana are you running on?

I suppose that it might be easier to use the dev tools for editing watches or adding a to parameter in the dev tools, so that your watch UI should be back to working.

I'll try to get a fix in, once all the information is provided.

Thank you!

--Alex

Thanks Alex, replied to you on the other post.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.