I have a classic pattern during the day and a noisy pattern (with peaks and null values) during the night because there is less traffic at night than during the day.
We can also see the learning pattern. This pattern is too global, and, even if my pattern is regular all days, the learning pattern isn't really close to the reality because of the night.
So, I have some questions :
Is it possible to analyse only some hours during the day ?
Is it possible to improve the model if I want to keep all my data ?
The screenshot you provided shows the type of diurnal periodicity that ML is designed to model and learn. We do learn the peaks and drops in the daily values, as well as different patterns that happen across days of the week and days of the weekend. The night times can be modelled.
The screenshot also shows that the model does not fit well. I'd suggest a couple of things here...
What does it look like over a longer time period?
In general, we need to see at least 2 days, and at least 2 weekends before we can start to model the daily pattern. In your example, the daily pattern in not yet seen and I suspect this is due to the very large spike. I cannot tell from granularity of the chart if this is occurring at exactly the same time every day. Regardless, this spike is likely to have the effect of ML requiring a longer learning period, perhaps 3-4 weeks.
What job config do you have?
It sounds like mean(responsetime) however at what bucket_span? It may be that my running the same job for a variety of different bucket spans might yield better results. I would suggest trying 15m and 60m and looking to see if there is a difference, along with a longer learning period.
It is possible to only model certain times of day. You could do this by using a query filter in the Advanced Job Config. It would also be possible to create a Saved Search, which only returns certain hours of the day. I would not propose this as the best way to improve your ML results however. I'd suggest that the longer learning period would be the first thing to try.
It seems to be better with a bigger bucket span (I tried 30m and 1h) :
I still have several questions about that :
I have the feeling that peaks in the morning, nights and weekends impede the baseline to be closer to my metric : during the day, my response time is between 200ms and 300ms maximun, but after 2month and a half, the upper bound is still around 900ms as you can see on the previous screenshot.
My response time and several others metrics (as number of requests) can't be negatives, but the baseline is frenquently under zero. Can we specify that the metric can't be negative ?
In an other case, with sum of logs (number of requests), I have a problem with bank holidays. Here is what I can see :
In fact, after one or two weeks, I have a really good baseline as we can see. But, July the 14th, French National Day, I haven't traffic. After this day, we can see how the lower bound is negative and still under zero during a long long time (there is other bank holiday in august). So, there is a specific treatment for theses days ?
My last question is on the same example than the previous question. I can see :
I don't understand why there are anomalies during the afternoon July the 14th while my metric is between the two bounds. I also don't understand why the baseline change really quickly. Just after one hour under the lower bound, the lowerbound is going to decrease quickly and doesn't stay at the same level than other days. So, this day, I haven't traffic but I haven't really important alarm (just minor level).
Spikey data can be challenging to model, it's true. I am glad that you are seeing better results after a longer learning time, and I hope we can help you get an even better model fit.
What version of ML are you using? We have had some recent improvements to periodicity detection, so would be good to know if you are using the latest.
When using mean or sum, it is not possible to specify that the value is always positive... we have not provided the configuration option, because it is not often true that the value is always positive (except in your case of course).
One caveat is when you are analysing a count of something, but the data is aggregated, and the ML function chosen is a sum. This sounds like your example for sum(number of requests). This is really a count. Using the Adv Config, it is possible to create a job that uses summary_count_field: number_of_request and then the detector can use a count function. This will be aware that the value cannot be negative. I suggest to try this, as it is preferred to use a count function for counts.
The bounds plotted are a 2D simplified representation of a complex model. In some instances you may see an anomaly lying within the bounds. This can happen in cases where modes have been detected in the data. The plotted bounds (the shaded area) will span multiple modes. (It also useful to check that the chart aggregation interval is that same as the bucket span i.e. zoom in).
The anomalies on July 14th, occur approx 16 days after the model has begun to settle. This still may be a case of too little time to learn, although I would be interested to see the results from modelling the data as a count function.
You dataset is very interesting. If you are able to share it with us, we could help you further with configuration and provide more detailed explanations, and it may help us improve the modelling and configuration experience.
I am using 5.5.0 version. Maybe, it could be better with 5.6.0 ?
I tried with your suggestion (summary_count_field: number_of_request and detector: count). In this case, I can't see the shaded area. I just can see my metric and anomalies.
As you can see, there are fewer anomalies. The red and yellow anomalies were the same as the other function. But, bank holidays have generated more alarms with sum function than count function.
With sum function :
With count function :
I will to try with the new version and send you a part of my dataset.
Just so you know, when you create an advanced job, the "shaded area" (the representation of the model of expected range of your metric) is not turned on by default like it is in the single metric job.
You can, however, turn it back on by manually editing the job JSON. Copy and paste the following to the analysis_confg section of the configuration JSON:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.