Setting document limits for Machine Learning anomalies

I've successfully created some Population machine learning jobs but I'm seeing a lot of false positive anomalies generated from members of the population with too small a sample size for those metrics to converge to anything meaningful.

For instance, suppose an average value of 10 over 100 documents is an anomaly. But one document with a value of 10 isn't even though that's the same average. I want a member of the population returned as an anomaly only if its aggregate metric is high/low, but also if its total document count in the bucket span is high enough to care.

Making the bucket span longer isn't an option.

Is there an easy way of limiting the ML job to a minimum document count to "count?"

As soon as you say "if its total document count is high enough to care" - you're defining a rule. That's perfectly fine, but just so you know - you'll have to manually define what "high enough" means.

The Custom Rules part of the ML job allows you to override the definition of what is considered anomalous. However, it is only for what's being measured in the detector function. So, if you are using the mean function, you can control whether or not the anomaly on mean is "high enough" or "low enough". Same with the count function if you are measuring the event rate as a function of time. But, you cannot control a different aspect that is not the detector function. In other words, you cannot control anomalies on the mean depending on the count of documents.

In order to accomplish this, you'd need to put that logic in the alerting. When creating a Watch, you would use a "chain input", which allows more than one search to define the alert. The first search would be to look in .ml-anomalies-* for the anomaly in the mean (and locate the offending entity as the influencer), then use a second search to determine the number of docs that this entity has in that timeframe. The condition in the watch is where you'd put the rule/threshold that you define what would be "high enough" for the doc count. If those two things are met, then the alert can notify you.

An example of a chain input watch is here: A watch alert example based on two different searches using CHAIN input and Painless script condition

Another example is here:

Thanks for confirming this and the idea to use Watcher as a workaround but I feel you should be able to do this in the job config itself.

My datafeed already specifies a "summary_count_field_name" value of "doc_count" which means the datafeed knows the total number of documents in any given bucket despite what the detector metric happens to be.

I can see why this would get problematic if you wanted to control one metric you are calculating, with another you're not, but every aggregation bucket already has a total doc count returned along with the aggregated value.

There's no way to use this somehow? The only workaround is to have the anomalies (which I know aren't anomalies) reported incorrectly, and then configure Watcher to ignore them?

You could make an aggregate value using a script field or a bucket_script aggregation that is the combination of the document count and the metric (i.e. doc count * metricvalue). Then have that value modeled over time by ML.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.