I've set up a ML job that tracks all of the servers that we have deployed metricbeat on by using the Detector low_count thats then split by beat.name so that each server is tracked. I've setup a Watcher that matches the following condition: When count(), overall all documents, is below 1 for the last 1 minute an email/log is generated. The idea behind this watcher is to let us know that a Metricbeat has stopped reporting completely and that there may be an issue with a server due to the fact that its not sending any documents. This works great with the 1 minute interval but when you set, for the last, to 5 minutes it never triggers when we turn off Metricbeat on a specific server. Is this due to the fact that it is seeing Metricbeat data coming in from other servers over the 5 minute span so they are in essence overriding the fact that MB is not reporting on one server. Also, out aggregation interval and bucket span are set to 1 minute if that makes a difference. Appreciate the insight.
how about changing your watch a little bit (not sure if the watcher UI is capable of that), so that you run a terms aggregation on the hostnames for the last minute, and another terms agg on hostnames for the last 5 minutes, then compare those two in the condition and check if any are missing?
I'd be curious as to how you set up the watch itself (the actual JSON) but one thing to keep in mind is that ML jobs write results with a timestamp equal to the leading edge of the
bucket_span. So if your
bucket_span is 5m, and if an anomaly occurs in the interval of 11:00-11:05, the anomaly record is written with a timestamp=11:00. And, since there is also a slight delay in the ML job due to the
query_delay parameter, then this results document which has a timestamp=11:00 is really written with a clock time of perhaps a little past 11:06. So, if your watch isn't accounting for this, you will "miss" seeing the anomaly document.
So I've ben able to get some watchers to fire at a more consistent rate so thats a step in the right direction. What I'm learing however is that if you use one or just a few boxes to turn metricbeat on and off the ML program pick up on that and they don't produce critical errors to easily sort the servers by in the alerting dashboard. This seems a little odd with the ML metrics due to the fact that the server is not reporting at all, which should be a critical error, yet ML only produces the lowest level warning of, warning. I do get how testing this multiple times effects ML but it still seems odd to me that critical warnings are not produced when metricbeat stops reporting. I'll try to post what I have once I fine tune it all.
Stumbled across a really good reference to work off of. Hope this help anyone thats looking to do something similar. https://github.com/elastic/examples/blob/master/Alerting/Sample%20Watches/system_fails_to_provide_data/watch.json
Yes, the more you test, the more you reinforce to ML that the behavior of being "off" is normal and it will indeed be naturally scored less - that's not surprising.
And yes, if you're just looking for a binary on/off condition, then ML is overkill and you can accomplish everything with a pretty straightforward watch. ML comes into play when you're trying to detect things that aren't so easy to define with a rule/threshold, like the anomaly near the right hand side here:
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.