ML job state "failed"

Ant · March 1, 2019, 3:54pm

I have a single node cluster used for testing which when I restart it all the ML jobs "job status" go to failed. I have tried to stop the datafeed before restarting it but this made no difference and to get the job going again I needed to clone them which with 2 or 3 isn't so bad but as I add more that will be a serious issue. My prod cluster has 3 nodes, I assume this would handle it better but if I know I'm taking the cluster down is there something I can do to allow these to stop and start in a more graceful way?

richcollier · March 2, 2019, 12:37am

Before a cluster restart, you could:

Stop all running datafeeds
Close all open jobs

If there are a large number of jobs, you could script it - for example:

#!/bin/bash
HOST='1.2.3.4'
PORT=9200
CURL_AUTH="-u elastic:changeme"

echo
echo
list=`curl $CURL_AUTH -s http://$HOST:$PORT/_xpack/ml/anomaly_detectors?pretty | awk -F" : " '/job_id/{print $2}' | sed 's/\",//g' | sed 's/\"//g'` 
while read -r JOB_ID; do
   echo
   echo "Stoping  ${JOB_ID}'s datafeed..."
   curl $CURL_AUTH -s -XPOST $HOST:$PORT/_xpack/ml/datafeeds/datafeed-${JOB_ID}/_stop
   echo "Closing ${JOB_ID}... (ignore 409 error if job was already closed)" 
   curl $CURL_AUTH -s -XPOST $HOST:$PORT/_xpack/ml/anomaly_detectors/${JOB_ID}/_close
   
   echo
   echo
   echo "-------------"
   echo

done <<< "$list"

Ant · March 4, 2019, 11:27am

@richcollier I thought closing a job was more of a finalising state so wasn't doing that first, I'd stopped them just not closed then fearing I'd not be able to re-open them. Bit green on all this ML stuff.

Thank you very much for your insights and the very useful script!

system · April 1, 2019, 11:27am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
After ML node accidentally restart, Datafeed not working properly Elasticsearch elastic-stack-machine-learning	9	814	February 3, 2020
ML node restart, job recovery Elasticsearch elastic-stack-machine-learning	3	875	August 7, 2020
Machine Learning datafeed started but not processing data Elasticsearch elastic-stack-machine-learning	7	2299	October 9, 2017
Unable to delete a failed job in Machine Learning Beta Elasticsearch elastic-stack-machine-learning	5	1537	September 27, 2017
Error comes while close jobs in machine learning Elasticsearch elastic-stack-machine-learning	2	482	January 30, 2019

ML job state "failed"

Related topics