[6.1.1] Machine Learning Job + Watcher: Best config for hourly ingestion data

machine-learning

(Gustavo Matheus) #1

Hello!

I'm trying to figure out how to configure my Machine Learning Job and respective Watcher to run an hourly analysis + alarm over my Elasticsearch data. Here's my current situation:

  • My team is indexing data to Elasticsearch every hour. It starts at the beginning of the hour (xx:05) and takes a few minutes to complete (it ends about xx:17), that's why my Watchers runs on minute 30 of every hour (xx:30). The events timestamps are always aggregated by hour and recorded as the beginning of last hour (e.g.: 2018/01/24/14).
  • My ML Job Datafeed should run every hour to get new data from ingestion.
  • My Watcher should run on minute 30 of every hour to get anomalies from last hour ("timestamp": { "gte": "now-1h/h" }).

I tried some datafeed configurations for ML Jobs, but all of them ended with a "Datafeed has been retrieving no data for a while" message (and watcher also stopped triggering). What frequency and query_delay values should I set in order to get everything running synchronously?

On watcher, I'm running the last hour analysis ("timestamp": { "gte": "now-1h/h" }) and reporting anomalies with, at least, minor severity ("anomaly_score": { "gte": 25 } on bucket_results aggregation). After some tests, I've noticed that the query/aggregation generated by the ML page seems to fetch the "Overall" anomalies only, but not the term-split analysis. In the case below, for instance, we split our analysis using userEmail field and got a 27-point severity anomaly in one of these emails (yellow square), but the "Overall" analysis shows an "Max anomaly score: 1" message:
anomaly-example

I don't understand how the generated watcher fully works, but since it's a multi-metric Job split by userEmail field, shouldn't it be triggering those specific anomalies? How can I change the query/aggregation in order to fetch those values as well?

Thanks!


(David Kyle) #2

Hi Gustavo

The datafeed start time is aligned with its frequency so if the frequency is 1hour then the datafeed runs at minute 0 of every hour, if the frequency is 10 mins then the datafeed runs at 0, 10, 20, etc mins. For your case as you are ingesting data every hour the datafeed's frequency should be 1 hour. But your data is ingested at 17 mins past the hour, we can shift the datafeed's start time using the query_delay parameter. Setting query_delay to 20 mins causes the datafeed to run once every hour at 20 mins past.

These settings can be updated in the job config page, click the edit job button and go to the datafeed tab.

Regarding your Watcher question you have chosen a low severity threshold which can generate a lot of alerts I recommend using a higher threshold.

Watcher generates alerts on the bucket anomaly score, in your example the bucket anomaly score is low, it contains 1 yellow level anomaly for a single value of the userEmail field but that is not significant enough to raise the bucket score to yellow. In my experience it is better to alert on the bucket score rather than individual records as doing so will generate an abundance of alerts.


(Gustavo Matheus) #3

Hi David, Thanks for your help! I'll try the recommended job/datafeed settings!

About the watcher config, that's what I really need: alerting on individual records! If it generates too many alarms, I'll set a higher threshold (maybe major or critical severity). Is it possible to query these individual values instead of the "bucket" score?


(David Kyle) #4

Gustavo the Watcher config is plain JSON so you can change it to do anything you like, you'll find the config In the Watcher UI under the watch named ml-${your-job-name}

I think you will find studying the query used informative but the key line for you is:

 "condition": {
    "compare": {
       "ctx.payload.aggregations.bucket_results.doc_count": {
            "gt": 0
         }  

This says fire the watch action if the number of bucket results is > 0. Change bucket_results in the above to record_results.

Also you will need to adjust the threshold for which results are returned

        "record_results": {
          "filter": {
            "range": {
              "record_score": {
                "gte": 3
              }
            }

Set the record_score filter to your desired value.

I haven't tested this and you will have to manually make these changes for every ML generated watch so beware but hopefully this should be enough to get you going.


(Gustavo Matheus) #5

Hi David, thanks again for your help!

I've adjusted the watcher as you recommended and it worked! I also had to change some variables references inside email body in order to avoid execution errors (basically changing bucket_results references to record_results).

I also noticed there's no difference between influencer_results and record_results aggregations in my query results. This is probably because my only influencer is the same field configured in my partition_field_name, right?

Thanks!


(David Kyle) #6

You're welcome and good luck with the anomaly detection.

Yes I assume the aggregations look the same because they use the field.


(Gustavo Matheus) #7

David, another question:

Sometimes the hourly indexing job I mentioned in the first post takes longer than expected, and because of this, the ML Job running at every minute 20 of each hour is also fetching wrong data.

In some cases, we've noticed a false "Unexpected zero" anomaly, probably because of a delayed data indexing at this moment:
30

It would be great if we could configure the ML Job to start only after the data indexing is complete (and also configure the watcher to run when ML Job is done running). Is there any way to do that? Maybe through ML/Watcher APIs?

Thanks!


(David Kyle) #8

Hi Gustavo,

Yes from the screen shot it looks like your data is indexed after ML has analysed that time period. Would increasing the query delay not fix the problem?

In theory you could start the datafeed after your data has been indexed and run it for the last hour using the datafeed APIs but you would have to script this and trigger it after indexing. I wouldn't recommend this as the approach is likely to be error prone and you lose the benefit of the datafeed doing the work for you. Watcher doesn't have a manual trigger so you have to find a suitable schedule for it.

I personally would try to fix the indexing problem and ensure the data is indexed in a predictable manner rather than trying the above.


(Gustavo Matheus) #9

Hey David,

Increasing the query delay/watcher approach seems to fix the majority hourly ingestions, but we also plan to insert data each 15 minutes to another index, so we can't rely on ingestion time... also, if the datafeed API approach you mentioned works, we can deliver a more precise and earlier watcher analysis!

I manually tested some endpoints from both ML Job and watcher APIs:

ML Jobs:

  • Firstly, I created a ML Job through Kibana with datafeed configured with 1h frequency and no query delay (but I didn't start the datafeed itself).
  • After the hourly ingestion ends, I opened the job (/anomaly_detectors/<job_id>/_open) and started the datafeed /datafeeds/<feed_id>/_start, setting an end time to current hour.

Since my data's timestamp is aggregated by the beginning of the hour, the datafeed worked fine: it generated new analysis points for the last ingested hour in Anomaly Explorer. Running it for the first time took some instants longer than other hours, probably because it was running the analysis for the whole indexed period.

Watcher:

  • I created an advanced watcher with a cron config that will never be executed: "0 0 0 1 1 ? 1970" (I couldn't find a way to create the watcher without the trigger config).
  • Running /watch/<watch_id>/_execute for the case above worked fine!

The only problem in this solution is that the datafeed starting trigger returns an "started": true status only, and I need to know when the datafeed stopped in order to execute the associated watcher. Is it possible to get this information from the API request?

Also, please let me know if I'm going in the right direction with the approach above!

Thanks!


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.