Hi, I have a ML job for disk usage in percentage (0 to 100) , I had run a forecast in monday and gave me predictions over the 100% of the usage (193%, 232%, etc).
Yesterday I clone the Job and add the custom rule skip result AND model update when actual is greater than 100 now the forecast is different and doenst go above 100%, also the confidence model is thinner than before ( more confidence )
I was wondering if the custom rule was responsible for this change in the forecast, or was just the time that I run the forecast.
Skipping model update will affect all aspects of modelling: anomaly detection, forecasting, etc. Skipping results only affects anomaly detection. However, note that if this condition is affecting the results then you have a data issue because the condition actual > 100 only applies if the bucket actual values, i.e. the raw input to the model such as the mean of disk usage percentages in the time bucket, are greater than 100. This could happen for certain features say if your detector was using sum(percentage), when it is scaled by the document count, but if not I would check that the inputs really are percentages. Of course this might be an acceptable workaround for input data quality issues, but generally it would be better to debug where the values greater than 100 are coming from.
Hi @Tom_Veasey thanks for your answer, the detector that Im using is "hig_mean", I have about two month of data, and dont have any value greater than 100 in the index.
There are definitely reasons for issue 1. Forecasting doesn't know that there is a constraint on the disk usage so if there is a trend up it will happily extrapolate this trend to values greater than 100. We have thought about doing some additional work on forecasting. User defined variable constraints may be part of this.
Issue 2 is very weird. The mean of a collection of values will definitely be less than 100 if the raw values are less than 100. However, equally we are just taking a mean of values so there is not really scope for this code to contain any bugs. Also the rule condition just reads this value. So all in all I think it is more likely that somehow bad values are getting passed to the model. The mechanism by which this is happening I'm not sure. I can give you some suggestions of how would I debug. I would try and first check model actuals: if you enable model plot, your results index should contain documents with our actual value of every bucket. Without applying the filter I would check whether this index contains any docs with values greater than 100. If it doesn't it suggests the issue is with the rule condition if it does you need to understand where these values are coming from (I don't have any ideas for this). I would also apply the same filter condition (to exclude docs with value greater than 100) in the job datafeed and confirm whether or not it affects results as well. Based on the search result you posted it shouldn't, but it is still worth checking the data fed into the model.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.