From the screenshot, it looks like you have 3 ML nodes. In a multi-node scenario, if a node was restarted, we would expect the job to continue running on an alternate node providing there was sufficient memory resources available. If there were insufficient resources, then we would expect to see a message in the Job Message tab.
From your description, it sounds like you are not seeing the expected behavior.
First, please ensure that all nodes are the same version of elasticsearch and that your cluster health is green.
In order to troubleshoot this issue, please then take a job and stop its datafeed, then close the job then restart the datafeed. If the job resumes, then this is hopefully an easier workaround than cloning each job. If the job does not resume, then please check the Job Messages tab (available by expanding the row in the job list) for any error messages. If there are no errors listed, then please check the elasticsearch log on the node shown in the Job Messages tab. We would expect more detailed logging to explain why the job is not being allocated to another node.
and I reviewed job messages on "Not Working" jobs but I counldn't find any errors or strange messages such as insufficient resources on alternate node or something like that.
By the way, The Job I wanted to close has not much data volume feeded, so I think the closing procedure won't take this long, about 3 hours still processing..
What coould I do for the next?
Should I wait for the job is clsoed?
Sophie is taking a well earned break at the moment. I will do my best to help you with your query. There's a few of us still checking this forum and I have asked for suggestions. I have a few ideas myself. Bear with me while I gather my thoughts and I will have something for you soon!
Hi again Jason, while we are still working on putting together some viable solutions for you would it be possible for you to supply us with log files please?
Preferably just the log file from the ML node running the job but it may be simpler for you just to send log files from every node in the cluster.
I'll attempt to download the log files now. If I have any trouble doing so may I contact you today? If not it will just have to wait until next decade!
Happy New year to you too! (It's already 2020 in my home country!)
I have looked at your logs and there are many errors and warnings in them, but it's hard to know which ones are relevant without knowing exactly which jobs are not working and exactly which node was restarted.
I can understand that you don't want to post these on a public forum. It will make it easier to have a discussion involving confidential information if you open a support case.
Also, it would be helpful to get a support diagnostic bundle for the affected cluster. If you open a support case then our support team can lead you through how to do this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.