I guess I did not understand the use of aggregation in anomaly detection. I was expecting that the same aggregation with different time intervals will not affect the results, however, it seems that this is not the case.
I created 3 anomaly detection jobs using aggregation as discussed here each of which running bucket spans of 6hr.
For each job, I created Datafeed with 3 different fixed_intervals: 1hr, 3hr, 6hr. Note that the bucket span is divisible by all these intervals.
I was expecting the same results for each job, however, even though I get exactly the same time series chart for the jobs, i.e., the actual values of the buckets are the same for the jobs, the anomaly scores for the anomalies are different. In some cases, some buckets are not considered as an anomaly when other job marks it as an anomaly.
Right first of all a little background on aggregation. If you group together metric values to form some sort of statistic (say the mean of those values) and if you assume that the individual values all come from the same distribution, then how the statistic is distributed usually depends on the count of values you have aggregated. You can test this out for yourself for example by charting the mean of noisy data in Kibana with different aggregation intervals. What you typically see is that as you choose longer bucket lengths (i.e. more values are averaged together) the chart becomes smoother. (In fact, if everything is independent the variance of the mean drops like 1 / "number of samples".)
For our anomaly detection when we try to model the distribution of, for example, the mean, what we ideally want is (aside from time dependent behaviour like seasonality) that the means come from the same distribution. However, if we fix the time interval we aggregate on, the count of values in each mean statistic will typically vary and so, by construction, we break the assumption we make that they come from the same distribution. To avoid this, rather than adding each time bucket mean to the model, we estimate how many values we get on average per time bucket interval and then always group together approximately this many values and compute their mean. When the rate is high we can add several samples per bucket when the rate is low we wait until enough values have arrived. The finer grained the sub-bucketing in the pre-aggregated data the more accurately we can achieve this. Indeed, if you just scroll the data rather than pre-aggregate we always use identical sample counts for all statistic values we add to the model. So this choice alters the values the model will actually see and hence alters its predictions and the anomalies it generates.
We still assign anomaly scores to time buckets and so the charts we display are all unchanged. However, when we then come to assess how unusual the time bucket is we ask ourselves what is the value and also what is the count of values in the bucket (so how do we expect the distribution to change).
A couple more notes:
This only affects some metrics, i.e. things like mean, median. Things like the count and sum should produce identical results for pre-aggregated data.
We typically recommend using around 1/10 the bucket interval for the pre-aggregation interval as a good tradeoff between performance and accuracy.
It is on the roadmap (and indeed we have a prototype) to autogenerate aggregations for you in the data feed API which should deal with all these issues for you automatically.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.