ML node restart, job recovery

Hi

If all available ML nodes are incidentally restarted and all running jobs go into an open/stopped mode, what would be the recommended way to resume all jobs? I've been able to start the datafeed manually for each stopped job through kibana dashboard, but wondering if there is anything else should be done to make sure that jobs are in consistent state? If there are several jobs impacted, is there a way to safely resume all jobs (bulk restart)?

Thanks

ML nodes are incidentally restarted and all running jobs go into an open/stopped mode

This isn't supposed to happen. If an ML node leaves the cluster then the jobs that were running on it should stay open and their datafeeds should still be started. Then they should get relocated to another ML node that has space, or wait until an ML node has space (which might be the restarted node when it rejoins the cluster).

Which version are you running? And are you using Elastic Cloud or your own install? We have had bugs related to this in the past and might be able to tell you which one you're seeing and which version it's fixed in.

You are correct that if a job/datafeed have incorrectly gone to open/stopped states then the first thing to try is restarting the datafeeds. Unfortunately we don't have a bulk restart for datafeeds. The start datafeed endpoint only works on one datafeed at a time. Maybe as a workaround you could write a bash script that loops over all the affected datafeed IDs and uses curl to call start datafeed for each one.

For the future, if you know in advance you are going to restart every ML node you could try setting ML upgrade mode. This will stop all ML processing but keep the job/datafeed states as they were. Then, after your maintenance is complete, unset ML upgrade mode using the same endpoint. This will result in less churn in the cluster during the maintenance as the ML jobs won't try to relocate to other nodes while ML is in upgrade mode. It's a bit like disabling shard reallocation before restarting all the data nodes. ML is supposed to work through node restarts even if you don't set upgrade mode, but setting it prevents unnecessary churn and may avoid bugs like the one you ran into.

1 Like

Thanks David for your response. You are right, it looks jobs were impacted by another issue which is resolved now. I will look into upgrade mode feature and also script solution : )
Thanks again for your suggestions

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.