ML jobs with missing documents

We have some jobs running with high numbers of documents missing. As per elastic ML documents, those checks are done after buckets with the missed documents have been processed and anomaly scores are finalized and “If there is indeed missing data due to their ingest delay, the end user is notified”. The question is how we can make sure not missing any document. Increasing the query delay usually works, but we need to make sure those lost ones are processed at the end (If notified soon enough, how stopping/starting datafeed to consider those time ranges impact the ML model and results? duplicate processing?)

In general, you want your query_delay to be set as high as possible to avoid missing documents due to ingest delays.

The ML job will not re-process past buckets unless you manually use the ML Model Snapshots API to revert the job to a model that was saved before your data was missed. You could pass the delete_intervening_results flag to delete any anomalies that surfaced since that time.

After this, you could re-start the datafeed from that moment moving forward.

1 Like

Many thanks for your quick response :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.