Anomaly detection score details



PFA the image showing 2 anomalies. The first has got a score of 55 but it is written that it is 2 times higher and second is with score 29 but it is more than 3 time higher. Questions is why is the anomaly with 3 times higher not having the highest score. It is 3 times higher. Can some one tell what is the logic used here? I think the probability distribution is used here for getting the anomalies but the details are not available.

In this case, the explanation is fairly simple. The time series model is learned online so at each time period anomalousness is assessed only based on what came before. With so little time to learn we won't assign very high scores. By the time of the second anomaly we've seen slightly more data and are a bit more confident.

That aside, as you observe the anomaly score is derived from the predicted distribution for the value, specifically the chance of seeing something more extreme (where extreme here means less likely or lower probability density). This isn't simply a function of how many times bigger the value is than typical, for some notion of typical.

Let's take some real examples.

Peoples' heights tend to follow a normal distribution (up to a point). I just looked it up and apparently, female height in the US is mean 5'4'' and standard deviation 2.8''. The chance of bumping into a female 6'6'' tall is around 1 in 5000000. Note that 6'6'' is just 20% taller than the typical 5'4''.

By contrast, many processes follow what's called a power law. Apparently, the diameter of Martian dust devils follows a power law up to a point, although there is some debate about this, and it is difficult to get distribution data online. So a much less fun example is income, which definite follows a power law, and for which it is easy to get distribution data. The mean household income in the US in 2021 is around $96k but the chance of meeting someone in a household with an income of more than $500k (or 500% more than typical) is just 1 in 100.

So for the first distribution you get a 1 in 5000000 chance of seeing a value 20% larger than typical and for the second you get a 1 in 100 chance of seeing a value 500% larger than typical. What this underlines is learning the right distribution is really important for assessing odds. You also typically need quite a lot of data to really get a good sense of how the tails of a distribution behave.

We update our belief about what sort of distribution the values follow as we see more data and so different buckets are not necessarily scored using the same distribution model. This is kind of online update is very important to make everything scale to really large data sets. And, anyway, for many data sets behaviour can change over time so you need to reassess your model. The upshot is reasoning like this value is 2X bigger than typical and this other value is 3X bigger than typical so its score should be higher doesn't really work (and shouldn't).

(Aside, our typical value is the median of the predicted distribution not its mean.)