Using data in Elasticsearch database directly for machine learning


I'm using the machine learning pack so far as below:

  1. I store the data to be analysed in a csv file format.
  2. I create required mappings using the console window in Kibana's 'Dev Tools' tab.
  3. I execute a custom script will upload the data from csv file into elastic search db.
  4. After successful upload, I create a new index pattern and use it for performing the machine learning job analysis.

Now, I want to use the data present in the elasticsearch database for machine learning analysis continuously in real-time. For example, I have an application whose metrics are uploaded into elasticsearch database for every minute. I want a real-time continuous anomaly detection system that will constantly monitor all the metrics uploaded into the database and provide analysis results based on it.

Could anyone help me in explaining how I can pull data directly from elasticsearch database and use for real-time machine learning job analysis tasks?

Note: I'm a novice beginner of elasticsearch database and therefore would be glad to have a detailed explanation.



You can configure a datafeed to continuously extract data in realtime. This is easily done in the UI when creating the job alternatively you can use the API


Thanks for the reply.

Could you kindly explain in detail how it can be achieved while creating a job in UI? I do not see the option to carry out the real-time operation.


Elastic search is not a database. It is an index. Because of this actual just about every aspect of working with the data if fundamentally different from non index storage mediums. While elasticsearch has some truly wonderful and really cool ways of allowing you to extract information from the data and even though it can do these fetches SUPER fast, there is a cost that you must be aware of. The reason it is fast is because the data has been indexed, and what remains beyond that it not a lot more than than a very literal representation of how you've asked it to index the data on your models or their meta. Indexing is expensive as such it only happens when forced ( not the best idea in most scenarios ) or the conditionals ( usually time ) have been met for it to go ahead and index. While it is doing its index, even at that point, nothing new can be added and nothing existing can be removed as it take a snapshot of what is, works with it until complete, end result being that you can now see that the data that was persisted. There are a couple functions like percolate and another which looks for oddities with in their defined parameters, but even these have conditionals that must be met. One or more of the analytical bits like those require seed data. The data is required to have a good amount of what is stale in order to even function. Elastic search also like approximations. it is more important that it gets an answer to you and that it is reasonably close to perfect, than to get you the exact answer. Consider a query of data which is to be represented on a graph of sorts, the actual result having some 9 million points to be plotted and only room for about 500 on the chart. There is a point in looking at the data when from the perspective its being view, no matter how many more point are shoved into it, given that its pack full of points, you could not see the additional 8 billion some points any way. Because of reasons like this and the favoritism to speed over accuracy, it has gotten really really good at giving you acceptably inaccurate ( wrong ) answers which are never a representation of the exact here and now, and do so at an alarmingly fast rate.

I know that I did not provide you with a means to solve your issue, but I hope that in providing you with what you may consider a limitation, depending on the sensitivity of accuracy and time you have, to be useful, even if that means you have to try and fund another technology which maybe better suited.

If you create a single or multi-metric job once the analysis has finished in the bottom left hand corner of the page you will see an option to continue the job in real-time.


This starts a real-time datafeed that will pull data from your index pattern. As Wes pointed out you have to wait for your data to be indexed before it can be analysed this is achieved by adding latency via the query_delay option on the datafeed. If you want to add a query delay to your job then use the advanced job configuration option on the create job page.

Have you seen the Machine Learning Lab videos you may find them a useful resource.


Thanks for the reply. As said earlier, I currently feed data from csv file. My requirement is that I need to fetch data directly to machine learning analysis instead of csv.
I need to show a demo in which I need to use data without constantly using the excel sheet. Lets say, I have a server system with various metrics like cpu usage, memory, etc. I need to analyse this data to detect if there is any anomaly. Is using the API the only option for this task?

To be clear, I'm currently in the first step of extracting data for the analysis and once I figure it out, I can move on to the real-time analysis task. I want to eliminate the csv file step and make the data readily available to the machine learning.


Just use FileBeat or Logstash to real-time "tail" the CSV to ingest into Elasticsearch. Then, the ML job can be configured to read data from that index for near-real-time analysis.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.