I went through the definitive guide published by O'Reilly and I couldn't solve my problem. I have also found a similar questions regarding this, but I haven't understand quite well how it is possible to do so.
Let's assume I am running an instance of the ELK on VM inside a node of my cluster (so in this case there is no distributed architecture). I was wondering whether it is possible to implement – let's say – the current state-of-the-art unsupervised real-time anomaly detection algorithm on time series (assuming I have a simple flow of log data). Would I then be able to visualise the time-series and the outliers in Kibana in some sort of way?
Now, I would like to run the instance of ELK on a distributed architecture (assuming I have more elasticsearch instances and many nodes) and do the same implementation as mentioned above. Will it run in a distributed way? Would still be resource and time efficient like the ML-Anomaly Detection included in the X-pack?
If so, in at least one of the cases, could you point me to the right source (books, blogs, etc) in order to learn how to do perform such task?
Elastic ML is the state-of-the-art unsupervised real-time anomaly detection algorithm on time series
In all seriousness - there are probably more than 100 "person-years" of research and development of the codebase that is Elastic ML. And, it is not just the anomaly detection algorithms/techniques. It is also all of the other logistical details:
How to leverage both historical and real-time data
How to persist model state and "pick up where you left off" in the case of a node/cluster restart
How to both deal with raw and/or aggregated data
How to snapshot/restore those models to an earlier version in case there was a problem
How to filter out data you don't want analyzed that is mixed in with data that you do want to be analyzed
How to automatically split analysis across instances for parallel analysis
How to ignore results that have special meaning in one's domain
How to ignore time-frames that are known to be problematic
How to provide guardrails against excessive memory consumption that could cripple a node
How to deal with sparse data
How to manage the scheduling and throttling of querying for data to analyze and not overburden a cluster
How to format and publish results that are useful for UIs or API calls
How to clean up/maintain information that is no longer needed
...and many other details
These are all details that Elastic ML does for you. I'm not saying that one couldn't manage to implement some of these things on your own, but you should know what you're getting into!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.