ML datafeed with bursty data

machine-learning

(Matt) #1

Hi,

I have a feed of data coming in to my ES cluster which can arrive in bursts. I initially set the feed to bring in the last 14 days worth of data so I had some data to backtest. This can take hours as there are millions of records. After half of it had been ingested, I created an ML model and the lookback completed up until the current record at that time. Why do the subsequent records with later timestamps not get processed? The datafeed is set to real-time and even stopping the datafeed and restarting it from the last timestamp in the index to real-time does not cause it to continue processing the remaining records up to the current point in time.

This leads me to another question. Due to the nature of the data we are ingesting, it can potentially arrive over the course of a day. Does the datafeed start from the latest timestamp in the index or the last time it ran? I am aware that I could set a query delay on the datafeed - i suppose for an entire day. But, how would you recover if the source index got more than a day behind - would you need to recreate the entire job?

Many thanks,

Matt.


(rich collier) #2

Hi Matt,

Yes, if your data is that delayed, then you'll have to use query_delay of something quite large (like maybe an entire day?) so that you are not missing data.

The ML job, when running in "real-time" will be constantly looking for data for each bucket_span (i.e. think of it as "every bucket span, analyze the last bucket span's worth of data"). If the ML job finds no data in the current bucket (because it hasn't been indexed and is not searchable yet), then the ML job just moves on to the next real-time bucket and tries that one. It never goes back to see if data has been backfilled.

Therefore, delaying the whole process using query_delay is the means by which you can run on-going and not miss data that's highly delayed.


(Matt) #3

Hi Rich,

Thank you for the response. That's a shame as it means our models will have to be a day behind.

When I stop the datafeed and restart it, from the last timestamp of data processed to real time it does not reprocess any data it has missed. Is this expected behaviour? I'd assumed that it would continue the lookback from that date?

Is the only option therefore to recreate the job?

Many thanks,

Matt.


(rich collier) #4

Yes, that is unfortunate unless you can get your data ingested in a more timely fashion.

The datafeed only goes in one direction - forward in time. So, if a datafeed is started and given a start time of X, it won't look at any timeframes before X to see if there happened to be any additional data that somehow got added since the last time it looked. Yes, your option would be to clone the job and re-run the historical data.


(Mark Walkom) #5