Setting document limits for Machine Learning anomalies

As soon as you say "if its total document count is high enough to care" - you're defining a rule. That's perfectly fine, but just so you know - you'll have to manually define what "high enough" means.

The Custom Rules part of the ML job allows you to override the definition of what is considered anomalous. However, it is only for what's being measured in the detector function. So, if you are using the mean function, you can control whether or not the anomaly on mean is "high enough" or "low enough". Same with the count function if you are measuring the event rate as a function of time. But, you cannot control a different aspect that is not the detector function. In other words, you cannot control anomalies on the mean depending on the count of documents.

In order to accomplish this, you'd need to put that logic in the alerting. When creating a Watch, you would use a "chain input", which allows more than one search to define the alert. The first search would be to look in .ml-anomalies-* for the anomaly in the mean (and locate the offending entity as the influencer), then use a second search to determine the number of docs that this entity has in that timeframe. The condition in the watch is where you'd put the rule/threshold that you define what would be "high enough" for the doc count. If those two things are met, then the alert can notify you.

An example of a chain input watch is here: A watch alert example based on two different searches using CHAIN input and Painless script condition

Another example is here: https://gist.github.com/richcollier/7e5603c366b9fcece6f1a8b1b3cf4d3f