I have an ML job running and for the case below, it registered anomalies but then stopped reading it and thus did not send me an alarm (via watcher which uses averages over a certain time) despite satisfying all the conditions set. The Screenshot is attached. It seems like for the circled area, the Job did not not register zero detections (as it it stopped), and the averages in the return function below were not calculated. Can you let me know why this may happen?
My watcher filters the alarm calls using:
return (t.total_anomaly_score.value > ctx.metadata.min_anomaly_score && t.average_deviation.value < 0 && t.doc_count > 2 && t.average_actual_value.value == 0 && !t.key.contains('PD') && !t.key.contains('newos') && !t.key.contains('S') && !t.key.contains('uninstalled'))
There really isn't enough information here to diagnose your situation. What is your detector configuration?
It looks as if the ML job is no longer getting data. Is there a problem with your ingest pipeline? Did the data just cease to exist?
By the way, if you're using a metric based detector function (i.e.
min, etc as opposed to a
count based function) and there is no input data, the detector function treats the lack of data as a
null and doesn't evaluate anomalies during those periods of no data.
Do any of the recent Annotations (9 through 12) mention anything interesting?
Hi, thanks for your reply.
The detector configuration is count. This only happens rarely, but I am trying to understand the reason behind it to make our ML job more reliable. Otherwise there is no problem with the data, because detection for all other devices is fine.
one of the annotations is " Datafeed has missed 686 documents due to ingest latency, latest bucket with missing data". The others are just snapchat storage ones
Ok, well certainly if you see the "missed documents due to ingest latency" message routinely you should either remedy your ingest pipeline latencies or increase the
query_delay on the ML job's datafeed to lag behind real-time a little more to accommodate those latencies.
Regardless, if you have ML raising anomalies, but your Watch isn't alerting - then usually this is a sign that your watch isn't looking back in time far enough. In general, keep the interval over which you have your watch looking for recently created anomalies to be no shorter than the equivalent of twice the
bucket_span . If you make the interval too short, your watch may inadvertently miss newly created anomalies. This is because the timestamp of an anomaly written to the index is equal to the beginning time of the bucket. So, the ML job currently processing the bucket of data between 11:55 and 12:00, will be indexed any anomalies found in the timeframe with a timestamp of 11:55 (obviously this is a 5-minute
bucket_span job). The clock time at which this anomaly record is indexed will be even later due to the
query_delay parameter of the datafeed and any other time associated with processing that data. As such, a watch triggered at 12:00 (clock time) which looks back for anomalies with timestamps as early as 11:55 will not see anything because the anomaly record hasn’t even been indexed yet (and won’t be for another amount of seconds equal to
query_delay). This is why keeping the interval to a width equivalent to twice the
bucket_span will ensure that anomalies won’t be missed by your watch.
Thanks a lot for a thorough explanation. It is really helpful. I will increase the query_delay to see it helps.