Single metric job not real time, not reporting missed ingestions

v7.1.0

I am trying to create a simple single metric job to monitor the total number of documents ingested daily on a specific index. I took the following steps in Kibana to create what I thought would be an appropriate job:

  • select index I want to monitor
  • create single metric job
  • select count as the aggregation and 1d as the bucket span (I'm using a "date" field for time, which is only accurate to the day)
  • set the date range to: beginning of data set - now
  • select "continue job in real time"
  • create watch for a job

When I start this job, it catches previous ingestion failures (creates anomaly warnings for points in the data where the document count did not meet what was expected) that occurred and were fixed before the creation of this monitor.

However, I recently had another ingestion failure that this monitor did not catch. Instead of reporting 0 documents for the days in which the ingestion failed, and subsequently creating anomaly warnings, the graph simply stops:

The date range is also stuck at the last detected timestamp:

If I manually set the date range to now, I can see the expect anomaly warnings. The only issue is that the monitor keeps reverting to the last detected timestamp as the stopping point, even after setting and updating the date range:


Am I missing something? Is there a way I can stop the monitor from stopping at the last detected timestamp and instead report days on which no documents are ingested (and hence the latest timestamp is not updated)?

Hi - I'm a little confused about what you're describing here, mainly because of some of your word choices (for example, we don't use the term "monitor")

So, perhaps I can ask a few clarifying questions to get at what's going on:

  • I assume your datafeed for this job is on-going (currently in the "running" state) and that you did nothing to alter that state between screenshot 1 and 2?
  • What were the timestamps of the two orange anomalies and does it match up with the latest_record_timestamp of the job?
  • Are you going into ML's Single Metric Viewer from the link on the jobs page, navigating directly, or clicking the link from the Watch?
  • Is this just a display problem (in your mind) or did the alert from the Watch not work either?

Sorry about the inconsistent language; I will try to clarify as best I can. At a high level, I am trying to create a single metric ml job that analyzes the document count of an index each day and marks any anomalous data.

As for your questions:

yes, I did not stop or alter the state of this job between the 2 screenshots. All I did was change the date range as shown in the images in my original post.

2019-06-22 00:00:00 was the latest_recorded_timestamp of the job. The two orange anomalies are timestamped 2019-06-23 00:00:00 and 2019-06-24 00:00:00 respectively.

From the job management tab of the machine learning page I am navigating to the single metric viewer from the graph icon in the actions column of that job

I did not get an alert, and I doubt that it is purely a display problem given the fact that I am not getting any anomaly warnings in the anomaly explorer tab.

Thank you for your response.

Hi,

I think the problem you are reporting here on opening the Single Metric Viewer (and would also affect the Anomaly Explorer) is that the end time range set when opening from the links on the Job Management page use the latest_record_timestamp from the job, which is the timestamp of the most recent piece of data received.

If the job continues to run, but has stopped receiving data due to an ingestion failure for example, you will continue to get anomalies due to the count dropping to zero. The timestamps of these anomalies will be after the latest_record_timestamp used as the end time range when opening the view. This would explain why the two anomalies only show up when you manually alter the end time date to 'now'.

Would this explain what you're seeing? If so, we need to make a code change to set the end date on opening to max(latest_record_timestamp, latest_bucket_time + bucket_span - 1ms). If you can confirm that there is an ingestion failure after 2019-06-22 00:00:00 I will raise a GitHub issue to get this fixed.

Not sure though that this would explain why you are not getting alerts from the watch you set up on this job.

Great, thanks Pete

Samg, this still does not explain why you didn't get an alert (or so you say). The anomaly records are in the index (because they do show up in the UI once the date range is set correctly) so there is little reason why those records would be missed by the Watch.

If you want to debug that aspect of things, we will need to see the code from your Watch.

Yes, this is what I believe is happening. I have confirmed with a colleague that there was an ingestion failure resulting in 2019-06-22 00:00:00 being the last recorded timestamp.

These might be separate issues, and I will be uploading json from my watcher momentarily for debugging purposes.

Thank you for your time.

we will need to see the code from your Watch.

Please find below with personal information redacted from my email and the email body removed from the end to meet the character limit.
Thank you.

{
  "trigger": {
    "schedule": {
      "interval": "70s"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          ".ml-anomalies-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "filter": [
                {
                  "term": {
                    "job_id": "testing_reindex_anom"
                  }
                },
                {
                  "range": {
                    "timestamp": {
                      "gte": "now-2880m"
                    }
                  }
                },
                {
                  "terms": {
                    "result_type": [
                      "bucket",
                      "record",
                      "influencer"
                    ]
                  }
                }
              ]
            }
          },
          "aggs": {
            "bucket_results": {
              "filter": {
                "range": {
                  "anomaly_score": {
                    "gte": 75
                  }
                }
              },
              "aggs": {
                "top_bucket_hits": {
                  "top_hits": {
                    "sort": [
                      {
                        "anomaly_score": {
                          "order": "desc"
                        }
                      }
                    ],
                    "_source": {
                      "includes": [
                        "job_id",
                        "result_type",
                        "timestamp",
                        "anomaly_score",
                        "is_interim"
                      ]
                    },
                    "size": 1,
                    "script_fields": {
                      "start": {
                        "script": {
                          "lang": "painless",
                          "source": "LocalDateTime.ofEpochSecond((doc[\"timestamp\"].date.getMillis()-((doc[\"bucket_span\"].value * 1000)\n * params.padding)) / 1000, 0, ZoneOffset.UTC).toString()+\":00.000Z\"",
                          "params": {
                            "padding": 10
                          }
                        }
                      },
                      "end": {
                        "script": {
                          "lang": "painless",
                          "source": "LocalDateTime.ofEpochSecond((doc[\"timestamp\"].date.getMillis()+((doc[\"bucket_span\"].value * 1000)\n * params.padding)) / 1000, 0, ZoneOffset.UTC).toString()+\":00.000Z\"",
                          "params": {
                            "padding": 10
                          }
                        }
                      },
                      "timestamp_epoch": {
                        "script": {
                          "lang": "painless",
                          "source": "doc[\"timestamp\"].date.getMillis()/1000"
                        }
                      },
                      "timestamp_iso8601": {
                        "script": {
                          "lang": "painless",
                          "source": "doc[\"timestamp\"].date"
                        }
                      },
                      "score": {
                        "script": {
                          "lang": "painless",
                          "source": "Math.round(doc[\"anomaly_score\"].value)"
                        }
                      }
                    }
                  }
                }
              }
            },
            "influencer_results": {
              "filter": {
                "range": {
                  "influencer_score": {
                    "gte": 3
                  }
                }
              },
              "aggs": {
                "top_influencer_hits": {
                  "top_hits": {
                    "sort": [
                      {
                        "influencer_score": {
                          "order": "desc"
                        }
                      }
                    ],
                    "_source": {
                      "includes": [
                        "result_type",
                        "timestamp",
                        "influencer_field_name",
                        "influencer_field_value",
                        "influencer_score",
                        "isInterim"
                      ]
                    },
                    "size": 3,
                    "script_fields": {
                      "score": {
                        "script": {
                          "lang": "painless",
                          "source": "Math.round(doc[\"influencer_score\"].value)"
                        }
                      }
                    }
                  }
                }
              }
            },
            "record_results": {
              "filter": {
                "range": {
                  "record_score": {
                    "gte": 3
                  }
                }
              },
              "aggs": {
                "top_record_hits": {
                  "top_hits": {
                    "sort": [
                      {
                        "record_score": {
                          "order": "desc"
                        }
                      }
                    ],
                    "_source": {
                      "includes": [ "result_type","timestamp","record_score","is_interim","function",
"field_name","by_field_value","over_field_value","partition_field_value"]
                    },
                    "size": 3,
                    "script_fields": {
                      "score": {
                        "script": {
                          "lang": "painless",
                          "source": "Math.round(doc[\"record_score\"].value)"
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.aggregations.bucket_results.doc_count": {
        "gt": 0
      }
    }
  },
  "actions": {
    "log": {
      "logging": {
        "level": "info",
        "text": removed
      }
    },
    "send_email": {
      "throttle_period_in_millis": 900000,
      "email": {
        "profile": "standard",
        "to": ["sam.g@.com"],
        "subject": removed,
      }
    }
  }
}

FYI, raised GitHub issue for the setting of the time range on opening the Single Metric Viewer: https://github.com/elastic/kibana/issues/39770

Thank you. I can see now the reason why you didn't get an alert is simply because you're only set to alert on red anomalies (>75)

...and your anomalies that were shown were in the orange range (50 to 75).

Ah, that makes sense, thank you for helping me debug this.

So going forward, with these changes, can I expect that even if there are no anomalies being shown on the Kibana single metric UI, I will get an email alert if there is anomalous data?

Yes, as long as the anomalies match the condition you're searching for (in this case with a score > 75)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.