ML job state "failed"

I have a single node cluster used for testing which when I restart it all the ML jobs "job status" go to failed. I have tried to stop the datafeed before restarting it but this made no difference and to get the job going again I needed to clone them which with 2 or 3 isn't so bad but as I add more that will be a serious issue. My prod cluster has 3 nodes, I assume this would handle it better but if I know I'm taking the cluster down is there something I can do to allow these to stop and start in a more graceful way?

Before a cluster restart, you could:

  • Stop all running datafeeds
  • Close all open jobs

If there are a large number of jobs, you could script it - for example:

#!/bin/bash
HOST='1.2.3.4'
PORT=9200
CURL_AUTH="-u elastic:changeme"

echo
echo
list=`curl $CURL_AUTH -s http://$HOST:$PORT/_xpack/ml/anomaly_detectors?pretty | awk -F" : " '/job_id/{print $2}' | sed 's/\",//g' | sed 's/\"//g'` 
while read -r JOB_ID; do
   echo
   echo "Stoping  ${JOB_ID}'s datafeed..."
   curl $CURL_AUTH -s -XPOST $HOST:$PORT/_xpack/ml/datafeeds/datafeed-${JOB_ID}/_stop
   echo "Closing ${JOB_ID}... (ignore 409 error if job was already closed)" 
   curl $CURL_AUTH -s -XPOST $HOST:$PORT/_xpack/ml/anomaly_detectors/${JOB_ID}/_close
   
   echo
   echo
   echo "-------------"
   echo

done <<< "$list"

@richcollier I thought closing a job was more of a finalising state so wasn't doing that first, I'd stopped them just not closed then fearing I'd not be able to re-open them. Bit green on all this ML stuff.

Thank you very much for your insights and the very useful script!