I have Watchers created for numerous ML jobs, which are monitoring Metricbeat data. For example, one of my ML jobs monitors system.cpu.iowait.pct for unusual increases in cpu time spent in iowait (all my ML jobs I have host.name as an influencer). I have my Watcher configured to send an email when ML determines iowait is above the high_mean – this works as expected.
However, I now have a requirement to update my Watcher to send ML alerts (via a webhook) to our event management application when there is an anomaly AND when the anomaly has cleared. I am confident I can make the first part work however, I’m not sure how to generate the normal (cleared) message.
My question is what are the best practice for determining a ML job (for a particular host.name) has returned to normal?
I have seen similar questions on this forum with suggestions to query .watcher-history-* and/or .ml-anomalies-* but the problem I have found is that I’m unable to determine which host.name the Watcher triggered on and, more importantly, which host.name have returned to normal.
What would be nice is if there was a way to create a Watcher that would generate a clearing event (all is normal) after the associated ML job reports normal behavior – is there a way to do something like this?
I would greatly appreciate if someone could point me in the right direction.