Generic watch, specific alerts, duplicate alerting?

Imagine that I have 100 machines to monitor and that each machine is named by its number (i.e. the first machine is named ("1").
Imagine that I wish to alert if the disk usage is over 90% on any of the machines.
Imagine that LogStash is inserting data into ES every minute with the current disk percentage for each machine.
I would like to send an alert to pagerduty for every machine that is over 90% disk usage.
I know how to craft my search query to find machines with disk usage over 90%.
Suppose machine "5" goes over the limit if 90%.
I can generate a pagerduty alert via the watcher action that sends information about which machine is over the limit by indexing the payload like this: {{ctx.payload.hits.0._source}}..
Now suppose machine "7" goes over the limit. At this point I have 2 machines that are over the limit.
I think what I really want here is to alert on both machines.

question 1. How do I iterate over the over-the-limit machines to insert alerts into pagerduty? Do I have to write some Java or can it be completely done using the watcher config language?

question 2. My initial thought was that I'd search for all machines over the limit. I want to avoid sending an alert to pagerduty for machines that I've already sent alerts for. How do I avoid sending duplicate alerts? Note that I don't see how throttling can help me here since I don't want to stop alerting for other machines that newly go over the limit. Do I have to play some complicated game whereby I update the ES entry with a timestamp of the last alert sent and then clear out the alert when the condition becomes false? I certainly don't want to write a new watcher every time I add or subtract a machine from the monitoring pool.

I think I can use a script to iterate over the items in a transform (and for example, collect them into a list), but can I use a script to generate multiple actions/webhooks?

@spinscale
I value your input and watcher expertise.
I think I have outlined an common use case that I'm you have likely already thought about it.
I would love to get your feedback!

Since nobody has suggested a way to iterate over a collection assembled by a script taking input from a search input ...

I guess easiest is to collect/build the json list from the aggregation in an inline script,
pass the json list to a simple (python probably) server,
have the python server iterate over the list
pass each element of the iteration on to pagerduty

Hey there,

indeed you have a common use-case. The index action allows one to construct an array in the payload and index a document for each element of that array. However, other actions dont support this yet. We are currently thinking what is the best approach to support this - either support this per action, or have some config option that points to an array in the resulting payload and then applies all the actions for each element found.

In the meantime you could use the webhook action to write to logstash and there use the split filter and its json codec to execute an action for each element in the hits array. Logstash also has a pagerduty output.

--Alex

Yes, I was thinking of the LS "split" and "lines" filter, but I wasn't clever enough to think of sending Watcher output to LS. That's a nice workaround.

Of course, as I'm sure you would agree, it would be nice to have the functionality of those 2 LS filters in Watcher (i.e. converting one event to many). This way it would possible keep it all in Watcher and not have to deploy yet another piece of s/w (LS, python Tornado server, etc).

I will likely target the pagerduty v2 api which means that I'll have to use a webhook output from your proposed LS solution instead of using the the LS pagerduty existing output plugin.

So, thank you for the excellent suggestion. Any timeframe you'd like to share about native Watcher support for the split idea?

1 Like

Hey,

I totally agree with you, that we need such kind of feature in watcher. As we are currently working on other features, I have no idea when this one is going to get worked on.

Happy to ping here once I know more.

--Alex