I have Watchers created for numerous ML jobs, which are monitoring Metricbeat data. For example, one of my ML jobs monitors system.cpu.iowait.pct for unusual increases in cpu time spent in iowait (all my ML jobs I have host.name as an influencer). I have my Watcher configured to send an email when ML determines iowait is above the high_mean – this works as expected.
However, I now have a requirement to update my Watcher to send ML alerts (via a webhook) to our event management application when there is an anomaly AND when the anomaly has cleared. I am confident I can make the first part work however, I’m not sure how to generate the normal (cleared) message.
My question is what are the best practice for determining a ML job (for a particular host.name) has returned to normal?
I have seen similar questions on this forum with suggestions to query .watcher-history-* and/or .ml-anomalies-* but the problem I have found is that I’m unable to determine which host.name the Watcher triggered on and, more importantly, which host.name have returned to normal.
What would be nice is if there was a way to create a Watcher that would generate a clearing event (all is normal) after the associated ML job reports normal behavior – is there a way to do something like this?
I would greatly appreciate if someone could point me in the right direction.
Hi,
I guess this is not a question about watcher, but merely about how to write a query, that also includes such information. For this concrete example you will need an aggregation, that will count the occurence of each xyz.hostname value in the documents that match. For this you do need a terms aggregation. You can check out the documentation over here . Once you got this aggregation right, you can loop through the array of returned buckets and use each found hostname in the bucket. You might want to check out the mustache docs for this https://mustache.github.io/mustache.5.html
ML is overkill and you can accomplish everything with a pretty straightforward watch. ML comes into play when you're trying to detect things that aren't so easy to define with a rule/threshold.
I shall also tag few other folks from ML team here for more inputs: @dave.roberts
My reason behind using ML is that all of my nodes (many hundreds) have a unique "pattern" of how, when and why they use excessive resources. Most of which are normal. The typical example being high disk I/O during backups. My hope was that utilizing ML would reduce the high numbers of false positive alerts we receive from monitoring tools that simply trigger from a static threshold.
You would know the host.name if your original watch was looking at the .ml-anomalies-* index with result_type:record, because the host name that was offending is most likely contained in the partition_field_value (since the ML job is likely split by setting partitiion_field_name to host.name in the job config).
So, as for the "Clearing Notification" - watcher doesn't have this kind of logic built-in. You would have to implement it yourself. There are a variety of ways to do that, but ultimately it involves saving the "state" of "active" alerts (perhaps in addition to notifying, the first watch also creates this state index with the index action). Then you'd employ a second watch that monitors and manages that state index (along with it also querying to see if entities of interest are still anomalous according to the ML job).
In other words, Watch #1 notifies and records the state that the alert is "open" and Watch #2 is checking on "open" alerts and sees if they can be closed (also with a notification).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.