Monitoring Machine Learning Job States

Hi,

I wonder what would be the best approach to get alerts when a job fails or goes into an unexpected state. I have some failed jobs and some jobs showing closed with started datafeed ( so moving from open/started to closed/started) - is there a way to get alerts on those job states. The notification index might help, but still very noisy with manual closing/starting of jobs.

Thanks,
Sara

Hi @saraKM,

At the moment you can utilize Watcher to get notified about the job and datafeed states.

Use a search input with querying the .ml_notifications-* index for messages your are interested in. Example of the input:

{
  "input": {
    "search": {
      "request": {
        "indices": [
          ".ml-notifications-*"
        ],
        "body": {
          "sort": {
            "timestamp": {
              "order": "desc"
            }
          },
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "timestamp": {
                      "gte": "now-15m"
                    }
                  }
                },
                {
                  "terms": {
                    "message.raw": [
                      "Datafeed stopped"
                    ]
                  }
                },
                {
                  "term": {
                    "job_id": {
                      "value": "my_job"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  }
}

To avoid extra noise consider action throttling.

FYI Kibana Alerting provides a more convenient way of managing your alerting rules and action. It's actively developing and new alert types are included with each release. Feel free to create a GitHub issue in the Kibana repo for the alert type you're interested in.

Hope it helps.

Regards,
Dima