Watcher / Alerting: Time issues with frequent watch


(Nick Erber) #1

I am searching for a way to let an important watch trigger frequently about every 5 seconds.

The problem is, that sometimes an event takes some time to be sent by a Beat, filtered by Logstash and indexed by Elasticsearch , so the schedule of 5s and a range-query like ...

 "@timestamp": {
     "from": "{{ctx.trigger.scheduled_time}}||-5s",
     "to": "{{ctx.trigger.scheduled_time}}"
}

...can skip an event, which is not indexed yet, but happened in exactly this range.

I've been working around but didnt find a proper solution yet.

My tries until now:

  1. Interval of 5s, range-query from {{ctx.trigger.scheduled_time}}||-30s
    --> Problem: The condition is hit up to 6 times (because 5s fits 6 times in 30s)
  2. Same as above with throttle_period of 30s
    --> Problem: After the condition is hit once, the next action will only be done after 30s, which could be too much

So the perfect solution would be, that the watch checks every 5s if something (let's say EventA) happened in the last 30s. If so, the action (email) should be fired. Then, when another event happens (EventB), the action should again be fired. But if no new event happens, the action should not be fired only because EventA still meets the condition and range-query.

Is there any possibility to reach this?


(Alexander Reelsen) #2

Hey,

indeed it is very hard/impossible to catch every event exactly once. Several reasons for this

  • Delays in your ingestion pipeline, the time it takes processing plus transport plus indexing might be responsible for delays
  • Elasticsearch refresh intervals. Even when data is indexed in Elasticsearch, it might take some time before it is actually available for search (by default up to 1 second). This also implies that even if you add the index time to a document, you might miss it.

So possible solutions:

  • If you really need to check every event that comes in, you could take a look at the percolator and run this after or before the indexing operation.
  • You split your watches in two. The first watch is your frequency watch that checks for events, however does not send an email, but indexes into another index. The second watch queries that index, is responsible for removing duplicate entries (you could for example just aggregate on the hostname or alert name in that index), and then sends an email based on that information removing any duplicates before sending that email.

Hope, that helps.

--Alex


(Steve Kearns) #3

There might be another solution. It is a bit awkward, and won't work for all situations, but is worth mentioning.

The Chained Inputs feature of Watcher allows you to run more than one input. You could use the HTTP input to run a field stats request against the time-field of interest - this will show you the actual date range that currently exists in the searchable index, which would account for the time lag in getting the data processed, sent to ES, and the refresh period that @spinscale mentions.
Then your second input could be the search you're doing now, but instead of now-5s, you would do <ctx.field_stats_value_from_your_first_query>-5s. This would be considerably more robust.

If you wanted to account for bursts or backups in indexing, you could also consider extending the "lookback" period, and using throttling to ensure that you don't send too many notifications.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.