Machine learning typical value not set at the moment of firing watcher Part 2

Hi,
this thread is a continuation of the previous thread which was closed due to inactivity and can be found here I urge you to visit it.
Back at that time, I couldn't provide more info (watcher execution data, etc.) but yesterday we had new anomalies occurring and here I am with more info.

So again I've received an email from my watcher with typical value set to 0:

Here is the watcher execution result:

"execution_time": "2020-08-30T13:02:00.115Z",
"execution_duration": 1001,
"input": {
  "type": "search",
  "status": "success",
  "payload": {
    "_shards": {
      "total": 1,
      "failed": 0,
      "successful": 1,
      "skipped": 0
    },
    "hits": {
      "hits": [
        {
          "_index": ".ml-anomalies-c",
          "_type": "_doc",
          "_source": {
            "result_type": "record",
            "actual": [
              0
            ],
            "probability": 6.599327790345429e-14,
            "multi_bucket_impact": 5,
            "is_interim": true,
            "detector_index": 0,
            "initial_record_score": 74.0999656090286,
            "bucket_span": 900,
            "job_id": "my_tracker_casa",
            "function": "count",
            "typical": [
              5.820766091346741e-11
            ],
            "record_score": 74.0999656090286,
            "timestamp": 1598792400000,
            "function_description": "count"
          },
          "_id": "my_tracker_casa_record_1598792400000_900_0_0_0",
          "sort": [
            74.0999656090286
          ],
          "_score": null
        }
      ],
      "total": 1,
      "max_score": null
    },
    "took": 3,
    "transform": [
      {
        "recordsHtml": "<tr><td style='padding:4px 8px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#ccc;background-color:#fff;text-align:center;vertical-align:middle'> 2020-08-30 16:00:00 - 2020-08-30 16:15:00 </td><td style='padding:4px 8px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#ccc;background-color:#fff;text-align:center;vertical-align:middle'> 0 </td><td style='padding:4px 8px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#ccc;background-color:#fff;text-align:center;vertical-align:middle'> 0 </td><td style='padding:4px 8px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#ccc;color:#ff0000;background-color:#fff;text-align:center;vertical-align:middle'> Unexpected zero value for count! </td></tr>"
      }
    ],
    "timed_out": false
  },
  "search": {
    "request": {
      "search_type": "query_then_fetch",
      "indices": [
        ".ml-anomalies-*"
      ],
      "rest_total_hits_as_int": true,
      "body": {
        "query": {
          "bool": {
            "filter": [
              {
                "term": {
                  "result_type": "record"
                }
              },
              {
                "term": {
                  "job_id": "my_tracker_casa"
                }
              },
              {
                "range": {
                  "timestamp": {
                    "gte": "now-15m-2m",
                    "lte": "now-2m"
                  }
                }
              },
              {
                "range": {
                  "record_score": {
                    "gte": "50"
                  }
                }
              }
            ]
          }
        },
        "sort": [
          {
            "record_score": {
              "order": "desc"
            }
          }
        ]
      }
    }
  }
},
"condition": {
  "type": "compare",
  "status": "success",
  "met": true,
  "compare": {
    "resolved_values": {
      "ctx.payload.hits.total": 1
    }
  }
},
"transform": {
  "type": "script",
  "status": "success",
  "payload": {
    "anomaly_details": [
      {
        "recordsHtml": "<tr><td style='padding:4px 8px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#ccc;background-color:#fff;text-align:center;vertical-align:middle'> 2020-08-30 16:00:00 - 2020-08-30 16:15:00 </td><td style='padding:4px 8px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#ccc;background-color:#fff;text-align:center;vertical-align:middle'> 0 </td><td style='padding:4px 8px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#ccc;background-color:#fff;text-align:center;vertical-align:middle'> 0 </td><td style='padding:4px 8px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#ccc;color:#ff0000;background-color:#fff;text-align:center;vertical-align:middle'> Unexpected zero value for count! </td></tr>"
      }
    ],
    "recordsCnt": 1
  }
} 

Here you can clearly see that the typical value is set to an unreasonable low value of 5.820766091346741e-11 (that would be 0.00000000005820766091346741) hence the 0 in the report !!!
Next, when I try to search for the document with "_id": "my_tracker_casa_record_1598792400000_900_0_0_0" (my hit) I get 'not found' response ???

GET .ml-anomalies-shared/_doc/my_tracker_casa_record_1598792400000_900_0_0_0
{
  "_index" : ".ml-anomalies-shared",
  "_type" : "_doc",
  "_id" : "my_tracker_casa_record_1598792400000_900_0_0_0",
  "found" : false
} 

Furthermore, yesterday we had more that one anomaly, but we received only one mail for one time spam?

I run the search for anomalies for yesterday and got back four hits with valid (as expected) data:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : ".ml-anomalies-shared",
        "_type" : "_doc",
        "_id" : "my_tracker_casa_record_1598791500000_900_0_0_0",
        "_score" : null,
        "_source" : {
          "job_id" : "my_tracker_casa",
          "result_type" : "record",
          "probability" : 5.81779203366539E-11,
          "multi_bucket_impact" : 5.0,
          "record_score" : 57.347498714368754,
          "initial_record_score" : 57.347498714368754,
          "bucket_span" : 900,
          "detector_index" : 0,
          "is_interim" : false,
          "timestamp" : 1598791500000,
          "function" : "count",
          "function_description" : "count",
          "typical" : [
            71179.63058880977
          ],
          "actual" : [
            17566.0
          ]
        },
        "sort" : [
          1598791500000
        ]
      },
      {
        "_index" : ".ml-anomalies-shared",
        "_type" : "_doc",
        "_id" : "my_tracker_casa_record_1598784300000_900_0_0_0",
        "_score" : null,
        "_source" : {
          "job_id" : "my_tracker_casa",
          "result_type" : "record",
          "probability" : 1.945876626800812E-9,
          "multi_bucket_impact" : -5.0,
          "record_score" : 50.380002714959474,
          "initial_record_score" : 50.380002714959474,
          "bucket_span" : 900,
          "detector_index" : 0,
          "is_interim" : false,
          "timestamp" : 1598784300000,
          "function" : "count",
          "function_description" : "count",
          "typical" : [
            72401.65010601748
          ],
          "actual" : [
            5034.0
          ]
        },
        "sort" : [
          1598784300000
        ]
      },
      {
        "_index" : ".ml-anomalies-shared",
        "_type" : "_doc",
        "_id" : "my_tracker_casa_record_1598783400000_900_0_0_0",
        "_score" : null,
        "_source" : {
          "job_id" : "my_tracker_casa",
          "result_type" : "record",
          "probability" : 3.0230566050167503E-10,
          "multi_bucket_impact" : -5.0,
          "record_score" : 54.92473725923562,
          "initial_record_score" : 54.92473725923562,
          "bucket_span" : 900,
          "detector_index" : 0,
          "is_interim" : false,
          "timestamp" : 1598783400000,
          "function" : "count",
          "function_description" : "count",
          "typical" : [
            71962.48775115353
          ],
          "actual" : [
            2821.0
          ]
        },
        "sort" : [
          1598783400000
        ]
      },
      {
        "_index" : ".ml-anomalies-shared",
        "_type" : "_doc",
        "_id" : "my_tracker_casa_record_1598782500000_900_0_0_0",
        "_score" : null,
        "_source" : {
          "job_id" : "my_tracker_casa",
          "result_type" : "record",
          "probability" : 1.4999206758024146E-10,
          "multi_bucket_impact" : -5.0,
          "record_score" : 56.31752327302499,
          "initial_record_score" : 56.31752327302499,
          "bucket_span" : 900,
          "detector_index" : 0,
          "is_interim" : false,
          "timestamp" : 1598782500000,
          "function" : "count",
          "function_description" : "count",
          "typical" : [
            71373.65288633927
          ],
          "actual" : [
            2669.0
          ]
        },
        "sort" : [
          1598782500000
        ]
      }
    ]
  }
}

This is very frustrating, as you can see here I didn't even get my hit document which my watcher reported???

This sounds eerily similar to an older bug (ML alerts triggering on interim result) that existed about 2 years ago (which was fixed).

That bug centered around an incorrect calculation of an anomaly, expressed as an interim result (is_interim:true). An interim result is a result that exists, but the current time bucket is not yet complete - and the anomaly score could change by the time the current time bucket is closed (resulting in the record how having is_interim:false). The fact that the document that was an anomaly is now gone would be consistent with a situation where the interim result was deemed anomalous, but then the bucket was closed, it was no longer considered anomalous (and thus the record disappears).

I'm not saying that there couldn't be another bug in this area - there very well may be - because we shouldn't calculate an anomalous interim result for a count of 0

Until we figure out there's a bug here, perhaps you can avoid this situation by adding the following qualifier to your Watcher query:

                  { "term"  : { "is_interim" : "false"}}

in this way, you avoid looking at interim results.

Hi and thanks for your participation, I've added the suggested code to my watcher. Now, all I can do is wait for the next significant anomaly to occur and see will this fix the issue I'm having :sweat_smile:
Do you think this has any effect on why watcher didn't report on anomalies that happened around the same time (four of them which I get when I run the search)?

If your Watch is "missing anomalies" then it is possible that the Watch is firing too "early". In other words, WHEN your Watch actually runs the query to inspect the .ml-anomalies-* indices matters.

This is because anomaly records are written with a timestamp that is equal to the beginning of the bucket_span.

An ML job currently processing the bucket of data between 11:45 and 12:00 will index any anomalies found in the timeframe with a timestamp of 11:45 (obviously this is a 15-minute bucket_span job). The clock time at which this anomaly record is indexed could be inserted at 12:02 (or even later) due to the value of query_delay parameter of the datafeed AND any other time associated with processing that data on the ML node. As such, a watch that triggers/runs at 12:00 (clock time) which looks back for anomalies with timestamps as early as 11:45 will not see anything because the anomaly record hasn’t even been indexed yet (and won’t be for another possible 2+ minutes).

I see that you are almost doing this correctly here, but you may be right on the hairy edge of missing the anomalies

To be safer, I would change your range to:

                "range": {
                  "timestamp": {
                    "gte": "now-30m"
                  }
                }

This will ensure that anomalies won’t be missed by your watch.

Secondly, in your screenshot, I see that there's a system annotation for the Datafeed missing data (which is later backfilled). This is a separate, but equally important thing for you to remedy (usually fixed by increasing the query_delay value or figuring out why your ingest pipeline is running slow).

Just noticed in your watcher output, the name of the index in which your anomaly is written is:

However, when you looked later, you queried .ml-anomalies-shared. This would explain why you couldn't find it.

Can you find it now that you know that? What is the updated (finalized) anomaly record say?

Also, can you verify that this is a single-metric job? (i.e. no splits)

Hi,
I'll change my timestamp range part and introduce the query_delay value and see if it'll help.
Regarding the index name, I've tried to search my document in the '.ml_anomalies-c' but I've got back the error saying no such indices exist?

{
  "error" : {
    "root_cause" : [
      {
        "type" : "index_not_found_exception",
        "reason" : "no such index [.ml-anomalies-c]",
        "resource.type" : "index_expression",
        "resource.id" : ".ml-anomalies-c",
        "index_uuid" : "_na_",
        "index" : ".ml-anomalies-c"
      }
    ],
    "type" : "index_not_found_exception",
    "reason" : "no such index [.ml-anomalies-c]",
    "resource.type" : "index_expression",
    "resource.id" : ".ml-anomalies-c",
    "index_uuid" : "_na_",
    "index" : ".ml-anomalies-c"
  },
  "status" : 404
}

Maybe I've made a mistake (accidentally pressing the 'c' key) when copy/pasting the code and somehow the index name changed to '.ml-anomalies-c'? (I can't find any index with that name)
Regarding the missing Datafeed at the time of anomalies occurred there were some problems with the network which caused all the problems and also the data ingestion towards elastic.

Hmm...well I have no explanation other than a possible copy/paste error on that index name!