Variation in analysis results using different aggregation values

anomalyml · November 16, 2017, 10:23am

Hi,

I have a doubt with the aggregation fields present in "Fields" section while creating a new multi metric job analysis.

From my understanding the value selected from fields list say "Mean" or "Sum" will determine the method of operation by which an anomaly is detected inside the specified "Bucket Span". Lets say, I have an input data with a frequency of one minute. I select the bucket span as 10 minutes. The aggregation value determines how the anomaly is detected inside that 10 minutes bucket span of data.

Now, I tested a case where I have the same one minute frequency data. Using the data I created two different analysis with the bucket span set for both as 1 minute and aggregation function set as 'Mean' for one and 'Sum' for another. I expected the analysis result for both runs to match, as the bucket function matches the frequency of data then essentially we are performing the aggregation function on a single data i.e. per data basis. But, I see the analysis results are different. Anomalies are detected at different points in both runs.

Is this the expected behaviour?

Thanks

Tom_Veasey · November 16, 2017, 1:09pm

There are a few of small but important differences between sum and mean (aside from the way the measurements are aggregated to generate features to model) which can cause the differences and are worthwhile being aware of:

Our mean function is a sampled metric, as are min, max and median: we try and keep the number of measurements in each mean sample added to our model the same to avoid variation in count affecting their distribution. This means we take some time at the start to learn what is a sensible sample count, i.e. one which matches the typical count we see per bucket. Therefore, we don't use the very beginning of the data set for mean analysis but for this step. This is true even if the data are polled as in your case.
Sum treats empty buckets in a different way to mean: the mean completely ignores empty buckets whilst sum (approximately) treats them as zero. There is a function non_null_sum you can use if you want sum style aggregation and to ignore empty buckets.
In your case it doesn't sound like 2 should apply, but 1 definitely does apply and could cause small transient differences. These should diminish over time. Based on your description this would be my best guess as to the cause.

anomalyml · November 21, 2017, 5:29am

Hi,

Thanks for the clarification.

system · December 19, 2017, 5:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using aggregation in anomaly detection jobs Elasticsearch elastic-stack-machine-learning	2	507	September 22, 2021
Bucket span in Job management Kibana elastic-stack-machine-learning	5	591	May 9, 2019
Are these values right for Query delay, Frequency and Bucket Span? Elasticsearch elastic-stack-machine-learning	4	1932	July 31, 2020
Can't understand ML plugin Functionalities Elasticsearch elastic-stack-machine-learning	5	815	October 30, 2018
Abnormal behavior of anomaly detection found - Elastic ML Stack Elasticsearch elastic-stack-machine-learning	2	475	December 1, 2022

Variation in analysis results using different aggregation values

Related topics