ML nodes are incidentally restarted and all running jobs go into an open/stopped mode
This isn't supposed to happen. If an ML node leaves the cluster then the jobs that were running on it should stay open and their datafeeds should still be started. Then they should get relocated to another ML node that has space, or wait until an ML node has space (which might be the restarted node when it rejoins the cluster).
Which version are you running? And are you using Elastic Cloud or your own install? We have had bugs related to this in the past and might be able to tell you which one you're seeing and which version it's fixed in.
You are correct that if a job/datafeed have incorrectly gone to open/stopped states then the first thing to try is restarting the datafeeds. Unfortunately we don't have a bulk restart for datafeeds. The start datafeed endpoint only works on one datafeed at a time. Maybe as a workaround you could write a bash script that loops over all the affected datafeed IDs and uses
curl to call start datafeed for each one.
For the future, if you know in advance you are going to restart every ML node you could try setting ML upgrade mode. This will stop all ML processing but keep the job/datafeed states as they were. Then, after your maintenance is complete, unset ML upgrade mode using the same endpoint. This will result in less churn in the cluster during the maintenance as the ML jobs won't try to relocate to other nodes while ML is in upgrade mode. It's a bit like disabling shard reallocation before restarting all the data nodes. ML is supposed to work through node restarts even if you don't set upgrade mode, but setting it prevents unnecessary churn and may avoid bugs like the one you ran into.