I have one job running in elastic ML with following config:
"detector_description": "high_non_zero_count by \"client.user_name\" partitionfield=\"client.ip\"",
Looking at the critical results, I found one client which shows high count once in a day and low values otherwise. For first 9 occurrences of the high count there is no anomaly raised, for the 10th occurrence I get a high anomaly score (higher than 90). There is no anomaly raised for day>10. All these high occurrences are happening around same time each day and have similar values. so just wondering why I get this high anomaly score for the day 10(nothing for days before or after which is expected as this happens on a daily basis)?
This kind of question is really difficult to answer without a representative data set to look at. However, with that said, this job configuration could be a very bad idea if the cardinality of
client.ip is large, as you'd be expecting ML to individually model thousands (or even hundreds of thousands?) of unique combinations.
Maybe consider doing this via population analysis instead?
Thanks for the response @richcollier. I agree and understand that this behavior largely depends on the datasets. I was just wondering if there was specific rules for ML modeling algorithms in elastic which for example use this number of inputs for creating the first model and then start to raise anomalies for new incoming records which may makes some sense for this results? I may be able to provide a graph if that helps.
Regarding the cardinality, we had the same concern and had a population model. However, it looks that temporal model performs better here, probably because of different behaviors of entities. We still investigating the results though.
Elastic ML needs a minimum amount of data to be able to build an effective model for anomaly detection. Essentially, it's based on how quickly ML can get the first estimates of the various model parameters. For sampled metrics such as mean, min, max, and median, the minimum data amount is either eight non-empty bucket spans or two hours, whichever is greater. For all other non-zero/null metrics and count-based quantities, it's four non- empty bucket spans or two hours, whichever is greater. For the count and sum functions, empty buckets matter and therefore it is the same as sampled metrics (eight buckets or two hours). For the rare function, it'll typically be around 20 bucket spans. It can be faster for population models, but it depends on the number of people that interact per bucket.
But...anomalies "early" in the learning are generally less reliable than anomalies "late" in the learning. It is simply because the data is analyzed in chronological order. As an analogy - imagine meeting a dog for the first time and you notice that the dog is very quiet and sleeps a lot. You might think that's just how he is. However, if you've met the dog 100 times before, and usually he is very energetic, barks a lot, and loves to play - but now you notice he's very quiet and sleeps a lot - you might wonder if the dog is perhaps now sick in some way. Your assessment of the dog is more reliable with over 100 observations under your belt.
So, in general, anomalies that are raised once the modeling is more mature (perhaps 1-3 weeks into the learning) are often more reliable.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.