After ML node accidentally restart, Datafeed not working properly

Hi, This is Jason.

I'm using ELK 6.8 with ML.
Yesterday, One of our ELK nodes, which is for ML, was incidentally restarted, then Some of ML Jobs not working.

I mean "Not Working" is Job is still "Open" and datafeed is "Started" visually but No data is feeded internally.

The yellow highlighted jobs are "Not Working" on screeenshot

The Job within the red box has been cloned for testing after ML node was restarted, It works well.

I can solve "Stopped Datafeed" jobs with cloning but I don't want to do it one by one. Cause It's bothering if I have many Jobs :frowning:

So I'd love to know how I can resume datafeed to these "Not Working" jobs without cloning them.

Need your help! :pray:

Hi Jason

From the screenshot, it looks like you have 3 ML nodes. In a multi-node scenario, if a node was restarted, we would expect the job to continue running on an alternate node providing there was sufficient memory resources available. If there were insufficient resources, then we would expect to see a message in the Job Message tab.

From your description, it sounds like you are not seeing the expected behavior.

First, please ensure that all nodes are the same version of elasticsearch and that your cluster health is green.

In order to troubleshoot this issue, please then take a job and stop its datafeed, then close the job then restart the datafeed. If the job resumes, then this is hopefully an easier workaround than cloning each job. If the job does not resume, then please check the Job Messages tab (available by expanding the row in the job list) for any error messages. If there are no errors listed, then please check the elasticsearch log on the node shown in the Job Messages tab. We would expect more detailed logging to explain why the job is not being allocated to another node.

Thanks
Sophie

1 Like

Hi Sophie

Thanks for the reply.

We have the same version for all nodes on the cluster, Ver. 6.8.0.
I'd tried to close the one of my job but It's still closing since 3 hours ago.

and I reviewed job messages on "Not Working" jobs but I counldn't find any errors or strange messages such as insufficient resources on alternate node or something like that.

By the way, The Job I wanted to close has not much data volume feeded, so I think the closing procedure won't take this long, about 3 hours still processing..

What coould I do for the next?
Should I wait for the job is clsoed?

Regards,
Jason

Hi Jason,

Sophie is taking a well earned break at the moment. I will do my best to help you with your query. There's a few of us still checking this forum and I have asked for suggestions. I have a few ideas myself. Bear with me while I gather my thoughts and I will have something for you soon!

Ed

1 Like

Hi again Jason, while we are still working on putting together some viable solutions for you would it be possible for you to supply us with log files please?

Preferably just the log file from the ML node running the job but it may be simpler for you just to send log files from every node in the cluster.

Kind regards,

Ed.

1 Like

Hi Ed,

Thanks for the reply. I'm having a difficulty to download log files from the nodes in the cluster for security policies at work.

I'll ask the permission for this to my security team, So please give me few days.

Anyway, Happy New Year!

Regards,
Jason

Hi again Ed,

I got the logs and uploaded to my gdrive.
I've sent the link via PM.

Thanks for you support.
Happy New Year!

Thanks,
Jason

1 Like

Thanks Jason.

I'll attempt to download the log files now. If I have any trouble doing so may I contact you today? If not it will just have to wait until next decade!

Happy New year to you too! (It's already 2020 in my home country!)

Ed

1 Like

Hi Jason,

Do you have a support contract?

I have looked at your logs and there are many errors and warnings in them, but it's hard to know which ones are relevant without knowing exactly which jobs are not working and exactly which node was restarted.

I can understand that you don't want to post these on a public forum. It will make it easier to have a discussion involving confidential information if you open a support case.

Also, it would be helpful to get a support diagnostic bundle for the affected cluster. If you open a support case then our support team can lead you through how to do this.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.