[6.1.1] Machine Learning Job: reanalyse datapoints / ignore last bucket

Hi!

I'm trying to configure my ML Job + Watcher in a specific ingestion scenario:

  • Ingestions occurs once in a day and the events timestamps are aggregated in the beginning of the day.
  • The daily ingestion updates the day before yesterday (now-2d/d) values, but also inserts some documents in yesterday bucket (now-1d/d) - that will be replaced in the next ingestion (tomorrow).
  • The daily ingestion also update old buckets (e.g.: last week buckets).

Rules:

  • It should ignore last bucket (now-1d/d) since this value will be replaced tomorrow;
  • It should reanalyse old buckets updates (e.g: last week buckets).
  • The watcher should target the day before yesterday (now-2d/d) with the correct value analysis (I think this can be accomplished with a range + gte & lte filter in watcher)

I tested several delay values for datafeed but it seems that the job always fetch data from last bucket before it updates to the correct value (that's why I need to ignore last datapoint):
image

Is it possible to configure a Job with these rules above?

Thanks!

For the most part, the "normal" operation of ML is to analyze data in chronological order without any attempt to "go back in time" and re-analyze older data. The setting of query_delay is meant to simply delay analysis to account for delays in ingest.

The only way to have an ML job re-analyze old buckets is to simply have "disposable" ML jobs that essentially analyze all of the historical data every single day.

This could be automated via daily scripts that use the ML API to create, run, and report on a job.

So, in effect:

  1. create the daily job
  2. run this job over all historical data that's relevant to you
  3. report/alert as necessary
  4. delete daily job
  5. start the entire process over tomorrow

Hey @richcollier, thanks for your help!

Besides the docs itself, do you have a real example of the ML API usage? Can be a blog post, forum discussion, anything that could help me to start this out.

Thanks!

Take a look here for an example: https://gist.github.com/richcollier/5482702c7bef6de9a14ff29fa39ef21a