I have two jobs (Each job has two detectors with max and high_mean functions) working on same data and exclude frequent enabled for one. I have seen some high severity anomalies raised when exclude frequent is none, but nothing is detected with "all". So just wondering how these ML algorithms are defining frequent entities and changing the scores?can we expect those frequents are eventually ignored by first ML model (frequent none) later in data processing?
@saraKM, is the job using a by field or an over field for handling entities?
It could be that the only anomalous entities are the ones that are occurring frequently. If entity's correlate strongly with the detector metric, it might be that only very frequent entities have anomalous high_mean values.
Additionally, the introduction of Filter lists basically obviates the need for the exclude_frequent setting, which pre-dated Filters. With filters, you have specific control over which entities you'd like to omit from anomaly creation.
Although, you can have fine grained control over the exact field values you exclude with Filter lists and this may often represent the right choice, they do require some manual configuration and ongoing maintenance. There is also nothing to stop you using both a Filter list and exclude frequent. Note that setting the value to "none" simply means no values are excluded.
Exclude frequent is most likely to be useful in contexts where you know that frequently occurring events are not of interest. In this context, frequent means generates values in a significant fraction of time buckets. So whether a field's values are excluded is a function of the job's bucket length.
Assuming exclude frequent fits your needs, I would recommend them mainly in conjunction with a population analysis. For example, you might want to look for unusually high values of x for each entity, but ignore entities which are always active in the system you're observing.
Hi Ben, both detectors are using "by" and "partition" fields (low cardinality, less than 10) and bucket span is 15 minutes. I agree that they might all be frequent, but was confused why they are detected as high severity if they are frequently happening. As @richcollier and @Tom_Veasey suggested I might go with a mixed model approach.
Thanks for suggestions. For population analysis, you mean having another detector for population without exclude (in conjunction with two existing detectors with exclude enabled)? or just having a population analysis with exclude enabled? In first case does that help to decrease the severity at influencer level? I might be checking record level results which I assume are not impacted with other detectors results?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.