Create a simple frequency based alert using Watcher

Gurus,

I am looking to create a frequency based watch. Very simply, if the query returns any results at all in the last N seconds, send an alert. I have written something that works, but since I am new to writing watches, I feel like what I have may be overcomplicated. In this example, I have N as 30 seconds, so I have the schedule interval set to 30s, and then I have a range filter on @timestamp: "gte": "now-30s".

My watch is pasted below. Two questions, is there a more efficient way to do this? Secondly, is it possible for me to miss query matches.

{
  "trigger": {
    "schedule": {
      "interval": "30s"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "logstash-*"
        ],
        "types": [],
        "body": {
          "size": 0,
          "query": {
            "range": {
              "@timestamp": {
                "gte": "now-30s"
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "actions": {
    "my-logging-action": {
      "logging": {
        "level": "info",
        "text": "There are {{ctx.payload.hits.total}} documents in your index. Threshold is 1."
      }
    }
  }
}

Hey,

the watch itself looks fine. The discussion around missing query matches however is a completely different one. I have two things to add to that

  1. Elasticsearch has something called a refresh interval, which defines how often data is made available for search (by default every second). This means, that your data might only be available from the last 29 seconds, but when doing the next query it is no more in that time window
  2. The above however is only a minor problem. The IMO bigger issue is the fact, that the timestamp is usually the timestamp when the event has been created within the application. What is not taken account for, is the fact that this event needs to travel to Elasticsearch. Maybe you are sending data directly from beats to ES, but maybe you are sending it to a broker first, where it sits a few seconds and then it gets indexed. Also your ingestion could have a bigger delay due to a DDoS attack or network outage. This will add a bigger delay than the one second refresh above.

The question then is, are you good with ignoring those things, or do you want to query bigger time windows with the likelihood of duplicating alerts or add some more fancy mechanism for alerting.

Hope this helps!

--Alex

Thank you. That helps. To summarize what you said, if the watch runs every 30 seconds, but queries data for 30+N seconds worth of data, we should not lose any hits, but we may, occasionally, get some duplicates.

This kind of watch is very commonly written in alerting systems, so that we don't inundate the receiver of the alarm or the notification. It is usually called notification interval or alarm interval.

Here is the definition notification_interval from another, now ancient, notification framework called nagios:
notification_interval : This directive is used to define the number of "time units" to wait before re-notifying a contact that this service is still down or unreachable.

I think you may be interested in the acknowledgement/throttling capabilities of watcher in that context. Please see

https://www.elastic.co/guide/en/elastic-stack-overview/6.3/actions.html#actions-ack-throttle
https://www.elastic.co/guide/en/elastic-stack-overview/6.3/how-watcher-works.html#watch-acknowledgment-throttling

Nice! throttle_period looks very interesting. Any thoughts if there is a way to better rewrite my watch using throttle_period. Here is essentially what I need to do:

If there are any new hits to my query in the last N seconds do the appropriate action.

We are trying to use watcher to alert via pagerduty when high severity logs come through.

watches are running stateless so the definition of new is a tricky one. If you need state, you could always store the result count of a query in its own document using the index action, and compare that count at the next run of a watch, when running the same query with the same filters.

Another alternative could be, that if those documents are super rare, you store information, if you already processed these documents, but that is not feasible for higher volumes.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.