ML watcher configs

We have a watcher running for a daily ML job (1d bucket span) over .ml-anomalies* and fires for record results where in_interim is false and some other conditions on record_score and actual values. The watcher has a schedule interval of 30h and filters results:
"filter": {
"range": {
"timestamp": {
"gte": "now-31h/h",
"lt": "now-1h/h"
} } }

The watcher fires with some of the executions, however we still see some executions which are not firing while there are results matching in ml-anomalies. If we simulate watcher execution and change the timestamp to cover those results we can see the watcher is firing and showing the results, so the query and overall script conditions looks fine for those runs. I have seen this articles recommending to have an interval twice of the bucket-span. However, I'm still wondering if any other part of the configurations (ML job or watcher) can cause these missing results? What are the best approaches for ML watcher configurations so we can be sure not missing results?

thanks,
Sara

In general, keep the interval over which you have your watch looking for recently created anomalies to be no shorter than the equivalent of twice the bucket_span . If you try to customize this and make the interval too short, your watch may inadvertently miss newly created anomalies. This is because the timestamp of an anomaly written to the index is equal to the beginning time of the bucket. So, the ML job currently processing the bucket of data between 11:55 and 12:00, will be indexed any anomalies found in the timeframe with a timestamp of 11:55 (obviously this is a 5-minute bucket_span job). The clock time at which this anomaly record is indexed could be as late as around 12:01 due to the default 60s query_delay parameter of the datafeed and any other time associated with processing that data. As such, a watch triggered at 12:00 (clock time) which looks back for anomalies with timestamps as early as 11:55 will not see anything because the anomaly record hasn’t even been indexed yet (and won’t be for another 60 seconds). This is why keeping the interval to a width equivalent to twice the bucket_span will ensure that anomalies won’t be missed by your watch.

1 Like

Hi @richcollier, thank you for your reply. I have increased the interval and it looks working for missed results. however I found another inconsistency in the results which I'm not sure if related to interval or not - I had another watcher looking at a ML job results with a 10min bucket span and query_delay around 15min. Watcher has a timestamp filter with the following config to be sure all the results are written and final. I noticed that a execution of watcher has fired for a record with a record_score of 85 which I couldn't find in ml-anomalies results. However I found a record with the same values (timestamp, typical, actual, ..) and a record_score=24 and initial_record_score=85. So it looks that the score has been updated after bucket results have been finalized (is_interim:false). Is that an expected behavior? I was expecting that the score normalization happen before is_interim:false are written.

           "filter": {
                "range": {
                  "timestamp": {
                    "gte": "now-3h/h",
                    "lt": "now-1h/h"
                  }}},
              "must": [   {
                  "terms": {
                    "result_type": [
                      "record"
                    ]  } },
                {
                  "range": {
                    "record_score": {
                      "gt": 75}}},
                {
                  "range": {
                    "multi_bucket_impact": {
                      "lt": 2}}},
                {
                  "match": {
                    "is_interim": false}}
              ]

Scores can be renormalized after the record is initially written. So yes, you could get alerted on an anomaly with a score of 85, because that was the score when it happened. But, perhaps after some time has passed (and other, more egregious anomalies have occurred), the first anomaly can get downgraded to something lower so that relatively speaking, its score is in line with all anomalies that have ever occurred for that job.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.