I have unique data coming in once a day for a field and set up an advanced ML job with a bucket span of one day accordingly. The configured detector is a summation of a unique number field partitioned by another field.
Alerts are also set up so any anomalies above a threshold is emailed to me but I would also like to know when data comes in late (compared to its historic timing).
In general, yes. Because in each bucket_span, the data only needs to be queried once, then applied to both detectors. However, the viewing/interpreting of the results is easier in our UI (I find) if a job only has one detector.
Hi richcollier, I have set up the ML according to the time_of_day function with a bucket span of 1 day. However, it is not exactly how I would prefer it to behave.
In your experience, is it possible to get real time alerts for late data with unique data coming in once a day?
Shorter bucket spans (for example, 10 minutes) are recommended when performing a time_of_day or time_of_week analysis. The time of the events being modeled are not affected by the bucket span, but a shorter bucket span enables quicker alerting on unusual events.
The separate ML with the time of day detector worked well for late data congestion!
My current configuration for the first ML is a summation of a number field by a field that is consumed once per day. Therefore, I set a bucket span of 1 day. However, when setting up my alerts, I will only get 1 alert at the end of day (after the ML has run).
Are there any configurations for the ML to run real time for the alerts to be real time as well (with the constraint of how my data is coming in)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.