ML alerts triggering on interim result

alerting
machine-learning

(Ben Polzin) #1

After upgrading from 6.3 to 6.5.0 I've been getting Watcher alerts from ML jobs that are false positives. If I click on the alert link fairly quickly I see 0 hits when there should be some large number, but it also says "Interim result". If I wait a few minutes and refresh, the anomaly score goes from 99 down to <1 as it's found more than 0 hits.

Is this an indication of a performance issue on my ES cluster or some changed behavior in ES - ML - Watcher interactions with 6.5?


(rich collier) #2

Hello - I don't believe there were any relevant changes with respect to interim_result or interactions between ML and Alerting between v6.3 and v6.5.

In general, we shouldn't be creating "interim results" when the actual is less than the expected. We do that when the actual is more than what we've been expecting.

Can you possibly run the following in Dev Tools Console:

GET .ml-anomalies-*/_search
{
    "query": {
            "bool": {
              "filter": [
                  { "range" : { "timestamp" : { "gte": "now-3m" }}},
                  { "term"  : { "job_id" : "yourjobname" }},
                  { "term"  : { "result_type" : "bucket" }},
                  { "term"  : { "is_interim" : "true"}}
              ]
            }
    }
}

(Adjusting the job_id of course)

And try to catch the situation occurring. I'd like to see the output - please paste it here.


(Ben Polzin) #3

Thanks, Rich. Here's the output:

{
  "took" : 20,
  "timed_out" : false,
  "_shards" : {
    "total" : 70,
    "successful" : 70,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : ".ml-anomalies-shared",
        "_type" : "doc",
        "_id" : "fusion-prem-volume_bucket_1543527120000_60",
        "_score" : 0.0,
        "_source" : {
          "job_id" : "fusion-prem-volume",
          "timestamp" : 1543527120000,
          "anomaly_score" : 74.31915433005682,
          "bucket_span" : 60,
          "initial_anomaly_score" : 74.31915433005682,
          "event_count" : 0,
          "is_interim" : true,
          "bucket_influencers" : [
            {
              "job_id" : "fusion-prem-volume",
              "result_type" : "bucket_influencer",
              "influencer_field_name" : "bucket_time",
              "initial_anomaly_score" : 74.31915433005682,
              "anomaly_score" : 74.31915433005682,
              "raw_anomaly_score" : 25.807843120201568,
              "probability" : 5.140488806107606E-28,
              "timestamp" : 1543527120000,
              "bucket_span" : 60,
              "is_interim" : true
            }
          ],
          "processing_time_ms" : 1,
          "result_type" : "bucket"
        }
      }
    ]
  }
}

(rich collier) #4

Great, and now run this:

GET .ml-anomalies-*/_search
{
    "query": {
            "bool": {
              "filter": [
                  { "term" : { "timestamp" : 1543527120000}},
                  { "term"  : { "job_id" : "fusion-prem-volume" }},
                  { "term"  : { "result_type" : "bucket" }}
              ]
            }
    }
}

(rich collier) #5

By the way, I cannot seem to replicate your situation at the moment. Until we figure out what's going on you could work around the situation by modifying the Watch to ignore interim results. Just add a:

                  { "term"  : { "is_interim" : "false"}}

to the query the Watch is making.


(Ben Polzin) #6

Here's the result from the last query. Apologies if this was time-sensitive as I didn't see your reply right away.

{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 70,
    "successful" : 70,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : ".ml-anomalies-shared",
        "_type" : "doc",
        "_id" : "fusion-prem-volume_bucket_1543527120000_60",
        "_score" : 0.0,
        "_source" : {
          "job_id" : "fusion-prem-volume",
          "timestamp" : 1543527120000,
          "anomaly_score" : 0.0,
          "bucket_span" : 60,
          "initial_anomaly_score" : 0.0,
          "event_count" : 44522,
          "is_interim" : false,
          "bucket_influencers" : [ ],
          "processing_time_ms" : 0,
          "result_type" : "bucket"
        }
      }
    ]
  }
}

For now this isn't too painful so I'd like to continue working through this to find out the root cause, but thank you for the suggested workaround.


(rich collier) #7

I think I've reproduced your situation - will post an update soon. Hang tight.


(rich collier) #8

We have indeed reproduced the situation and have opened the following issue to address it:

Thanks a bunch for reporting!