ML job state "failed"

I have a single node cluster used for testing which when I restart it all the ML jobs "job status" go to failed. I have tried to stop the datafeed before restarting it but this made no difference and to get the job going again I needed to clone them which with 2 or 3 isn't so bad but as I add more that will be a serious issue. My prod cluster has 3 nodes, I assume this would handle it better but if I know I'm taking the cluster down is there something I can do to allow these to stop and start in a more graceful way?

Before a cluster restart, you could:

  • Stop all running datafeeds
  • Close all open jobs

If there are a large number of jobs, you could script it - for example:

#!/bin/bash
HOST='1.2.3.4'
PORT=9200
CURL_AUTH="-u elastic:changeme"

echo
echo
list=`curl $CURL_AUTH -s http://$HOST:$PORT/_xpack/ml/anomaly_detectors?pretty | awk -F" : " '/job_id/{print $2}' | sed 's/\",//g' | sed 's/\"//g'` 
while read -r JOB_ID; do
   echo
   echo "Stoping  ${JOB_ID}'s datafeed..."
   curl $CURL_AUTH -s -XPOST $HOST:$PORT/_xpack/ml/datafeeds/datafeed-${JOB_ID}/_stop
   echo "Closing ${JOB_ID}... (ignore 409 error if job was already closed)" 
   curl $CURL_AUTH -s -XPOST $HOST:$PORT/_xpack/ml/anomaly_detectors/${JOB_ID}/_close
   
   echo
   echo
   echo "-------------"
   echo

done <<< "$list"
1 Like

@richcollier I thought closing a job was more of a finalising state so wasn't doing that first, I'd stopped them just not closed then fearing I'd not be able to re-open them. Bit green on all this ML stuff.

Thank you very much for your insights and the very useful script!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.