I have a single node cluster used for testing which when I restart it all the ML jobs "job status" go to failed. I have tried to stop the datafeed before restarting it but this made no difference and to get the job going again I needed to clone them which with 2 or 3 isn't so bad but as I add more that will be a serious issue. My prod cluster has 3 nodes, I assume this would handle it better but if I know I'm taking the cluster down is there something I can do to allow these to stop and start in a more graceful way?
Before a cluster restart, you could:
If there are a large number of jobs, you could script it - for example:
#!/bin/bash
HOST='1.2.3.4'
PORT=9200
CURL_AUTH="-u elastic:changeme"
echo
echo
list=`curl $CURL_AUTH -s http://$HOST:$PORT/_xpack/ml/anomaly_detectors?pretty | awk -F" : " '/job_id/{print $2}' | sed 's/\",//g' | sed 's/\"//g'`
while read -r JOB_ID; do
echo
echo "Stoping ${JOB_ID}'s datafeed..."
curl $CURL_AUTH -s -XPOST $HOST:$PORT/_xpack/ml/datafeeds/datafeed-${JOB_ID}/_stop
echo "Closing ${JOB_ID}... (ignore 409 error if job was already closed)"
curl $CURL_AUTH -s -XPOST $HOST:$PORT/_xpack/ml/anomaly_detectors/${JOB_ID}/_close
echo
echo
echo "-------------"
echo
done <<< "$list"
1 Like
@richcollier I thought closing a job was more of a finalising state so wasn't doing that first, I'd stopped them just not closed then fearing I'd not be able to re-open them. Bit green on all this ML stuff.
Thank you very much for your insights and the very useful script!
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.