Am I doing something wrong ? Are my assumptions wrong ?
Also...even if I did get this to work...It seems to be that ElasticSearch is not really machine learning at all, but its statistically analysis ( training ) and threshold checking ( detection ).
What might I be missing, in execution, and in theory, in what ElasticSearch ML is trying to do ?
What is the bucket_span you are using for that analysis? What is the time granularity for the anomalies you want to detect?
In general, the time range of the data looks too short for the ML models to be able to make any meaningful detection. Could you try using data that cover a larger data spread?
Here's another weird one - i would think there is clearly an anomaly but no-go. I did not see any errors in the job management viewer....the bucket size is 1 second, and its 400000 data points over 30 minutes.
The reason you do not see any anomalies is because the model considers itself to not be reliable until it sees enough data. The definition of enough data also includes a requirement of having seen at least 2 hours of data. This is the reason you do not see any anomalies being detected.
With regard to the cannot parse scroll id error, it would be really helpful if you could trace and post the stacktrace that should be in the logs on the node that run that job.
Makes total sense the two hour thing - I guess if you assume a certain minimum data rate, then that should give you enough total data points. On that note, i recently did a 30 min recording and 400,000 data points - and I actually got a working anomaly analysis for some reason (see image). So maybe its a "soft" 2 hour minimum (?)
I ended up just trying a bunch of things ( single, multi, adv ) until something worked. It looks like several jobs stopped before processing the full 400K data points due to that "scroll id" thing. I'll definitely append a log when I see it again.
"enough data" is indeed an interesting metric - as typically anomaly detection depends on so many things. We have found in our work (outside of ES) that anomaly detection success in general depends number of data points and well as type of algorithm ( outlier analysis, isolation forests, unsup trees, deep auto-encoder, mixed clustering, etc. ) I wonder what type of of anomaly detection ES uses under the hood ?
So, I think I will continue to try increasingly larger data sets as you suggest.
I am collecting more info with regard to your questions.
In the meantime, I have managed to reproduce the cannot parse scroll id error having the datafeed include an index that has no mappings for the time field. Definitely worth checking if that is the case for you as well. If not, the logs will definitely contain useful information about this issue.
Update: I raised an issue and we'll work a fix to stop masking that error. Thank you for the valuable feedback on that!
Just to give you some additional insight into what we are doing. In terms of how we are identifying anomalous time periods: we are creating a prediction for the new bucket value, up to some uncertainty. We take particular care to try and also model this uncertainty accurately. Anomalies correspond to values which are unlikely to be drawn from this predicted distribution. This prediction makes use of a number of features over different timescales, some of which we test for, some of which we include by a type of model averaging.
In fact, at one point we based the rate at which we learn on the bucket count (whatever the bucket length). This in practice produced not such good results at startup for short bucket lengths. This is because 99% of the signals we saw have features over longer time spans, which are important for making predictions about the next values. One way we now deal with this is to actually reduce the amount by which we will narrow the prior distribution on various model parameters if the bucketing interval is short.
In terms of sources of delays, we delay the period for model selection at startup to deal with the standard issues one hits with BFs for non-informative priors. Also for aggregate metrics whose distribution depends on the sample count, i.e. things like mean, min, max, we arrange to sample the data in (as close a possible) fixed measurement counts per sample. We therefore take some time to work out what are the typical rate of records per bucket interval, so our chosen sample count is close to the mean rate of messages per bucket. This process can be delayed if we see significant variation in the data rates. We roll up some of this into a blanket 2 hours minimum for individual analysis, independent of bucket length. It can be longer. For population analysis, when the detector configuration includes an "over" field, if we see many individuals per bucket interval we are getting information at a faster rate and therefore we can generate results sooner.
Obviously, this is the minimum time to generate anomalies. We can't learn long time scale effects, like weekly periodicity, a slow trend, etc this quickly. In this context we try and be conservative, i.e. we try and arrange that failing to capture some important effect early in the model lifetime produces a blind spot rather than false positives.
Finally, we do have feature requests around modelling data with very short time scale features, particularly things with sub-second duration, so this is an area we are considering enhancing.
By BF, do you mean a "beta function" priors ? So, your approach to
anomaly detection is based on estimating a bayesian posterior ?
Also, given #1. It sounds like each time bucket is an event in the
"binomial probability" and each bucket is "conditionally independent" of
all others ?
Also, given #2. It sounds like the buckets are de-correlated in the
sense that all buckets contribute in a way that is not related to where
they are relevant to each other in time.
Thanks again. I'm really excited about ML in Elastic Search !
BTW, The Elastic sales rep has convinced me to do a Blog on my experience
here with Kibana 6.0. I hope to share that with you shortly.
Also, we have been investigating convolutional deep net approaches to time
series prediction (and anomalies), and we would love to share our results
with you if you are interested.
No problem at all and thank you for your feedback!
Regarding your questions:
By BF I meant Bayes Factor: in order to try and handle different data characteristics automatically we have a number of different families of model we can use for the noise like component of the time series. We use a Bayesian approach to combine these.
(And 3) we test/model correlation between bucket values at different time scales: different length periodic components, different timescale trends, etc. As you inferred, we use these to de-correlate the bucket values so that we can treat all buckets on an equal footing in step 1 (albeit we have special handling for values which appear highly unusual w.r.t. to our current model). The thing one immediately thinks of in this context is the expected value of the distribution displaying correlation. I found it was particularly important to allow for multiple statistical characteristics of the data to display temporal correlation; for example, the variation in the data, variation in the log of the data, etc.
Thanks for the offer, I'd be very interested to hear about your work and results using CNNs for time series prediction and anomaly detection.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.