ML job does not fairly distribute via all ML nodes

hello,

i am using ML function and set 3 nodes as ML node. First i created 3 ML jobs, the cluster spread 3 jobs into all 3 nodes. but after I restarted the cluster, 3 jobs were allocated into only one node.

Is there way to manually assign ML job into specific node?

Please advise. Thank you!

Hey @jasony

When jobs first start, they are assigned to the ML enabled node with the least load. Specifically, load related directly to ML (memory usage, number of jobs, etc.).

When you restarted your cluster, I am guessing at a certain point all the nodes running the ML jobs stopped and only one node was available for the tasks to run. Consequently, they all got reallocated to that one node.

We don't have any formal mechanisms for re-assigning jobs. The best solution would be to stop the jobs, and start them again. They should pick up where they left off and get re-assigned to the currently least-loaded node (one of the nodes without ML jobs assigned).

thank you for your advice. based on your advise, i stopped and started jobs then saw all jobs were re-assigned fairly.

thank you!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.