Machine learning jobs not reflecting new data

David_Tomasheski · March 30, 2018, 8:58pm

I have a few machine learning jobs running all on the same index pattern. They are single metric doc count jobs. I am getting new data in but my count in the Single Metric Viewer flat lines to 0. I have no idea what I'm doing wrong. They are not marked as not receiving new data just doesn't show it. Any advice would be greatly appreciated.

richcollier · March 30, 2018, 9:14pm

Hi David,

I assume that if you were to run the ML job over historical data, you don't get this problem (it's only with live data)?

If that's the case, the most likely culprit is that your ingest of your data is delayed to the point that when ML "looks for it" (i.e. ML looking for the data in the last X minutes, which is equal to the bucket_span) - the data is not yet there - but is there at a later time.

You can check to see how many events the ML job is seeing by looking into the .ml-anomalies-* index for your job_id and for result_type:bucket - and inspect the event_count field. Here, you can see a consistent 870 events per bucket.

So, if you're finding that only your "live data" is getting missed, you probably need to increase the query_delay parameter of the datafeed configuration:

...or figure out why things are not getting ingested/indexed as fast as they could be.

David_Tomasheski · April 3, 2018, 5:04pm

This seems to be my problem. My data doesn't come in continuously. So I'll get a burst of data starting at about 10 after the hour and that will last until about 10ish mins. This data reflects the previous hour so I'll get data for 9:00-10:10 in one big burst from about 10:10-10:20. Is there a good way to configure ML for this kind of setup. As this is external data there is no way for me to change the delivery to a more constant method.

richcollier · April 3, 2018, 6:26pm

So one option would be to do a 1 hour bucket_span and a query_delay of 30m. That would mean that at 10:30, ML would ask for, and analyze data from 9:00-10:00. Then at 11:30, ML would would ask for, and analyze data from 10:00-11:00, and so on.

If you desire a smaller bucket_span for analysis reasons (let's say 10m), then you'd have to increase query_delay to something like 90m. So that in this case, at 10:30, ML would ask for, and analyze data from 9:00-9:10 and then at 10:40, ML would ask for, and analyze data from 9:10-9:20, and so on.

In both of the above cases, make the frequency parameter equal to the bucket_span. For data that is more real-time, often the frequency parameter is a fractional value of the bucket_span. (See https://www.elastic.co/guide/en/elasticsearch/reference/6.2/ml-put-datafeed.html for more info). But, in your case, it doesn't really make sense for frequency to be anything other than equal to the bucket_span.

Hope that helps

David_Tomasheski · April 4, 2018, 3:16pm

Thanks this seems to have done the trick!

Topic		Replies	Views
Anomaly detection job failing to pull any data in Elasticsearch	4	379	May 5, 2021
ML job not updating in real time Elasticsearch elastic-stack-machine-learning	4	862	October 29, 2018
Troubleshooting with machine learning Elasticsearch elastic-stack-machine-learning	9	2082	August 30, 2017
Machine Learning datafeed skipping documents that seem to be there Elasticsearch elastic-stack-machine-learning	14	4182	April 9, 2019
ML jobs Elasticsearch elastic-stack-machine-learning	10	640	November 4, 2022

Machine learning jobs not reflecting new data

Related topics