Machine Learning Watcher does not send emails

Hello

I have very weird case. After I configure job for Anomaly detection and set real time watcher alerts, the system does not generate emails in case of anomaly ( no matter if I configure it to send emails in case of Critical or Warning severity ). I also tested to Simulate alert in the Management -> Watcher section, but email never came.

On the other hand, if I configure Threshold alert it successfully send emails.

What may be the reason for this ?

I am using # Kibana 6.7.1 version

I think you need some additional debugging steps to know what's going on.

Take your watch syntax and wrap it in a watch {} block and use the _execute API call to see what happens

Then, on the right hand side of Dev Tools console, you can see the information from the watch's execution.

My suspicion is not that "email is not working", but rather something is amiss in your condition part of the watch that is not returning true, thus not executing the action.

Hello,

It stated that the syntax is invalid in the way provided above and the job cannot be simulated. But after simulation of the original watch, I have output of the condition :

"condition": {
  "type": "compare",
  "status": "success",
  "met": false,
  "compare": {
    "resolved_values": {
      "ctx.load.aggregations.bucket_results.doc_count": 0

It is always in false condition, no matter what I am doing

Correct, so therefore, the action will never be fired. You should inspect what the input section is querying (so see if it makes sense) and look at the condition block to see if there's any incorrect logic in it.

If you want, you can post those sections of your Watch here and I can take a look....

Hello, you can find the whole config below

  "trigger": {
    "schedule": {
      "interval": "93s"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          ".ml-anomalies-*"
        ],
        "types": [],
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "filter": [
                {
                  "term": {
                    "job_id": "my-job-id"
                  }
                },
                {
                  "range": {
                    "timestamp": {
                      "gte": "now-30m"
                    }
                  }
                },
                {
                  "terms": {
                    "result_type": [
                      "bucket",
                      "record",
                      "influencer"
                    ]
                  }
                }
              ]
            }
          },
          "aggs": {
            "bucket_results": {
              "filter": {
                "range": {
                  "anomaly_score": {
                    "gte": 10
                  }
                }
              },
              "aggs": {
                "top_bucket_hits": {
                  "top_hits": {
                    "sort": [
                      {
                        "anomaly_score": {
                          "order": "desc"
                        }
                      }
                    ],
                    "_source": {
                      "includes": [
                        "job_id",
                        "result_type",
                        "timestamp",
                        "anomaly_score",
                        "is_interim"
                      ]
                    },
                    "size": 1,
                    "script_fields": {
                      "start": {
                        "script": {
                          "lang": "painless",
                          "source": "LocalDateTime.ofEpochSecond((doc[\"timestamp\"].date.getMillis()-((doc[\"bucket_span\"].value * 1000)\n * params.padding)) / 1000, 0, ZoneOffset.UTC).toString()+\":00.000Z\"",
                          "params": {
                            "padding": 10
                          }
                        }
                      },
                      "end": {
                        "script": {
                          "lang": "painless",
                          "source": "LocalDateTime.ofEpochSecond((doc[\"timestamp\"].date.getMillis()+((doc[\"bucket_span\"].value * 1000)\n * params.padding)) / 1000, 0, ZoneOffset.UTC).toString()+\":00.000Z\"",
                          "params": {
                            "padding": 10
                          }
                        }
                      },
                      "timestamp_epoch": {
                        "script": {
                          "lang": "painless",
                          "source": "doc[\"timestamp\"].date.getMillis()/1000"
                        }
                      },
                      "timestamp_iso8601": {
                        "script": {
                          "lang": "painless",
                          "source": "doc[\"timestamp\"].date"
                        }
                      },
                      "score": {
                        "script": {
                          "lang": "painless",
                          "source": "Math.round(doc[\"anomaly_score\"].value)"
                        }
                      }
                    }
                  }
                }
              }
            },
            "influencer_results": {
              "filter": {
                "range": {
                  "influencer_score": {
                    "gte": 3
                  }
                }
              },
              "aggs": {
                "top_influencer_hits": {
                  "top_hits": {
                    "sort": [
                      {
                        "influencer_score": {
                          "order": "desc"
                        }
                      }
                    ],
                    "_source": {
                      "includes": [
                        "result_type",
                        "timestamp",
                        "influencer_field_name",
                        "influencer_field_value",
                        "influencer_score",
                        "isInterim"
                      ]
                    },
                    "size": 3,
                    "script_fields": {
                      "score": {
                        "script": {
                          "lang": "painless",
                          "source": "Math.round(doc[\"influencer_score\"].value)"
                        }
                      }
                    }
                  }
                }
              }
            },
            "record_results": {
              "filter": {
                "range": {
                  "record_score": {
                    "gte": 3
                  }
                }
              },
              "aggs": {
                "top_record_hits": {
                  "top_hits": {
                    "sort": [
                      {
                        "record_score": {
                          "order": "desc"
                        }
                      }
                    ],
                    "_source": {
                      "includes": [
                        "result_type",
                        "timestamp",
                        "record_score",
                        "is_interim",
                        "function",
                        "field_name",
                        "by_field_value",
                        "over_field_value",
                        "partition_field_value"
                      ]
                    },
                    "size": 3,
                    "script_fields": {
                      "score": {
                        "script": {
                          "lang": "painless",
                          "source": "Math.round(doc[\"record_score\"].value)"
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.load.aggregations.bucket_results.doc_count": {
        "gt": 1
      }
    }
  },
  "actions": {
    "log": {
      "logging": {
        "level": "info",
        "text": "some info"
      }
    },
    "send_email": {
      "throttle_period_in_millis": 900000,
      "email": {
        "profile": "standard",
        "to": [
          "mail1@mail.mail"    
        ],
        "subject": "Alert",
        "body": {
          "html": "<html>\n  <body>\n  some text </body>\n</html>\n"
        }
      }
    }
  }
}

Thanks for that - this looks like the standard Watch that ML creates for you from the UI (as opposed to a custom-built Watch) so thanks for verifying that.

The next obvious thing to consider is that you don't indeed have any bucket anomalies with a score greater than 10 for your ML job in the last 30 minutes (that is what your Watch is looking for).

You can validate this simply by running this standard query all on its own:

GET .ml-anomalies-*/_search
{
    "query": {
            "bool": {
              "filter": [
                  { "range" : { "timestamp" : { "gte": "now-30m" } } },
                  { "term" :  { "result_type" : "bucket" } },
                  { "term" :  { "job_id" : "my-job-id" } },
                  { "range" : { "anomaly_score" : { "gte": "10" } } }
              ]
            }
    }
}

You should see no results...

You can then sanity check that there are indeed any anomalies for that job by searching back in time futher (like now-14d) - but you obviously won't want to make that edit permanently to your watch because that defeats the purpose of only looking "recently" for newly emerging anomalies.

Thank you for the explanation!

The problem is yesterday there was Critical anomaly detected by ML, but no email was sent. What could be the explanation for that ?

Not entirely sure - but I can think of a few possibilities:

  1. The email was attempted to be sent, but something (like temporary network connectivity issue) prevented it going through. You would likely see evidence of such a problem as error messages in the elasticsearch.log file at that time.

  2. The ML job was not running "in-sync" with the Watch. They are two separate processes. The ML job runs on it's own schedule and writes docs to .ml-anomalies-* when anomalies are found, with a timestamp equal to the leading edge of the bucket_span. So, if the bucket_span is 15m, then at a little past 12:00pm noon an actively running ML job could write a doc into .ml-anomalies-* with a timestamp equal to 11:45am. There's a reason why the range of the watch is usually 2 times the width of the bucket_span - it is to avoid a situation in which the watch "misses" the publication of that document. This would happen if the look back time of the watch was much shorter than now-30m in this case (I assume your ML job indeed has a bucket_span of 15m). However, if someone manually stopped the ML datafeed and started it again later, etc. that could mess up the "sync" between the ML job and the Watch.

  3. The anomaly that you "think" got missed was actually a "record" anomaly, and not a "bucket" anomaly. If you don't know the difference, see https://www.elastic.co/blog/machine-learning-anomaly-scoring-elasticsearch-how-it-works

1 Like

Thank you for the useful information, I will further check provided cases and figure it out.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.