How to configure ANOMALY DETECTION with DAILY buckets

Hi,

I am having a hard trying to figure out how to setup the datafeed of my Anomaly Detection Job.
I have a Transform job that aggregate all the daily data to the same day at 00:00.
I wanna run a Anomaly Detection to run onto this index. It is a daily bucket (bucket_span: 1d)
If I understand well the query delay should be 24h (query_delay: 24h), since I wanna wait until the end of the day for all the data of the same day to be aggregated at 00:00 this day.
And the frequency is set to 24h as well, since I only have one timestamp a day (frequency: 24h).

However it doesn't work. Let's say I got data for 2020-01-29 all aggregated at 00:00,
I will be expecting the ML job the process this document at 2020-01-30 00:00 but it does not work.
I have to wait until 2020-01-30 00:00 for the data of January 29th to be processed.

Any idea?
P.S.: All the date/time are expressed in UTC

Hi,

interesting combination of transforms and anomaly detection. I think for configuring the query_delay you also have to take additional delays of the transform into account. As I understand your transform runs every 24 hours, but results will not be immediately available. The transform needs time to process all data and persist the results. It seems that transform and anomaly detection run at the same time, but anomaly detection should run after transform. The query delay should therefore add a little extra.

Do you know how long it takes to transform the data of the last 24 hours? You should be able to find out by checking the audit information (job messages) in the UI, it sends a audit message after the checkpoint has finished. If it took e.g. 5 minutes you should add these 5 minutes to the 24 hours plus a little extra.

Regarding the anomaly detector configuration: Your configuration for frequency would still only trigger anomaly detection every 24 hours and it would run it probably too early. I suggest to lower it. e.g. to 1h, so your should get results 1 at 01:00. The unnecessary triggers till the next full run shouldn't be a big problem.

We will discuss this issue in the engineering team, to see how we can improve.

1 Like

FWIW, could you describe your usecase in more detail? What does the transform do, what anomalies are you looking for? Data feeds can run aggregations on their own, I wonder whether the transform is really needed. Additionally a bucket span of 24 hours is unusual for an anomaly detection usecase.

Hey @Hendrik_Muhs,

  • I set up a transform job with a date histogram set to 1d. All the data coming during a given day is gonna be aggregated to the same day at midnight.
    e.g.: Let's say I have data coming in at timestamp 2020-02-03 08:00:00 and another record coming in at timestamp 2020-02-03 23:59:59, both of them are gonna be aggregated at timestamp 2020-02-03 00:00:00 (The Transform frequency is 60s).
  • Therefor I wanna wait 24h before the Anomaly Job picks up the data.
    e.g.: The job running at timestamp 2020-02-04 00:00:00 should pick up the records at timestamp 2020-02-03 00:00:00.
    Setting the frequency to 24h and the query_delay to 24h: the Anomaly Job at timestamp 2020-02-04 00:00:00 is not gonna pick up the records at 2020-02-03 00:00:00, for that I'll have to wait until timestamp 2020-02-05 00:00:00 for it to be picked up.

Regarding the aggregation in the Transform job, I need to divide the SUM of two field. Is that something possible in the datafeed directly?
let's say I have two records like this:
{_id: 1, field_1: 3, field_2: 5} and {_id: 2, field_1: 7, field_2: 1}
I need to create a field like so: (3+7) / (5+1)

Yes, you can accomplish this via a script_field. See this example: ML Job on Scripted field

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.