Machine Learning datafeed started but not processing data

machine-learning

(Tim Arp) #1

Hi,

I hadn't looked at my machine learning jobs in a while, when I did look at them they were all in a state of
JobState = closed, DatafeedState = started. The Latest timestamp was from a month ago.

How can I get these working again? I tried to issue a stop and I get this.

Could not stop datafeed for http_requests_count [status_exception] Cannot stop datafeed [datafeed-http_requests_count] because the datafeed does not have an assigned node. Use force stop to stop the datafeed

I tried the force stop and that fails with this.
{
"error": {
"root_cause": [
{
"type": "send_request_transport_exception",
"reason": "[ldxx90elk16-isgeis][204.54.165.114:9300][cluster:admin/xpack/ml/datafeed/stop]"
}
],
"type": "null_pointer_exception",
"reason": null
},
"status": 500
}

Any ideas on how to get these running again?
Thanks,
Tim


(David Roberts) #2

Something abnormal has happened because it shouldn't be possible to have a started datafeed corresponding to a closed job.

At the time you got that "null_pointer_exception" response there could well be a stack trace logged to the Elasticsearch log file. If there is, please paste it into a reply. It could be in the log of the node you submitted the force close request to, or on 204.54.165.114 if that's a different node.

Also, please let us know which version of Elasticsearch you're running.


(Tim Arp) #3

The node where the force stop ran doesn't have anything in the logs.
I'm running 5.5.1


(David Roberts) #4

Please could you let us know a little more about your cluster:

  1. How many nodes does it have?
  2. Is 204.54.165.114 still in the cluster and healthy?
  3. Did you submit the force stop datafeed request to 204.54.165.114 or a different node?

(Tim Arp) #5

Hi,

Here's the list of servers.
192.43.1.201 56 99 9 0.25 0.62 0.66 d - ldxx90elk3-isgeis
204.54.165.185 55 91 8 2.16 2.03 1.96 d - ldxx90elk26-isgeis
204.54.165.165 49 98 4 0.55 0.87 0.94 d - ldxx90elk24-isgeis
192.43.1.202 43 93 9 1.25 1.33 1.50 d - ldxx90elk4-isgeis
192.43.1.198 40 94 1 0.10 0.03 0.01 m - ldxx90elk2-isgeis
204.54.165.138 70 99 13 0.57 0.75 0.87 d - ldxx90elk21-isgeis
204.54.165.163 69 97 7 1.53 1.59 1.66 d - ldxx90elk22-isgeis
204.54.165.113 69 91 0 0.03 0.04 0.05 m - ldxx90elk15-isgeis
192.43.1.211 66 98 8 0.49 0.76 0.88 d - ldxx90elk6-isgeis
204.54.165.114 43 95 31 0.51 0.60 0.58 m * ldxx90elk16-isgeis
172.16.151.240 45 90 2 0.15 0.10 0.06 - - ldxxpcelk2-isgeis
204.54.150.210 60 98 5 0.77 1.14 1.27 d - ldxx90elk14-isgeis
204.54.165.119 46 92 8 0.22 0.15 0.54 i - ldxx90elk18-isgeis
204.54.165.100 20 93 4 0.45 0.45 0.44 i - ldxx90elk12-isgeis
172.16.151.225 21 96 5 0.41 0.64 1.01 d - ldxx90elk25-isgeis
204.54.167.173 38 89 1 0.12 0.08 0.05 - - ldxxpcelk3-isgeis
192.43.1.11 38 99 8 2.06 1.43 1.13 d - ldxx90elk20-isgeis
172.16.151.239 52 91 2 0.01 0.06 0.05 - - ldxxpcelk1-isgeis
204.54.167.175 36 90 3 0.25 0.14 0.11 - - ldxxpcelk4-isgeis
204.54.165.164 46 90 7 1.89 1.26 1.08 d - ldxx90elk23-isgeis
192.43.1.203 71 98 6 1.15 1.11 0.99 d - ldxx90elk5-isgeis
204.54.165.118 45 93 7 0.07 0.17 0.23 i - ldxx90elk17-isgeis
204.54.165.99 23 93 6 0.14 0.14 0.20 i - ldxx90elk11-isgeis
204.54.163.182 37 98 7 0.40 0.51 0.57 d - ldxx90elk10-isgeis
204.54.163.155 37 98 4 0.01 0.09 0.19 d - ldxx90elk9-isgeis

I just have run the force stop from the dev console. I imagine it would be the same on any of the other nodes.

--Tim


(David Roberts) #6

I'm not sure how this datafeed has ended up in the state it's in. 204.54.165.114 is your master node, so that kills the theory that the node that's causing the null_pointer_exception from this request isn't part of the cluster any more.

Sending the force stop request to the master node directly (using curl or some other mechanism that will let you specify the exact node to send it to) might make a difference. The reason is that the null_pointer_exception is happening as the request is sent on from the node that received it to the master node. If it didn't have to be transported within the cluster it might not hit the bug it's currently hitting. (Obviously it shouldn't matter which node the request goes to, but the null_pointer_exception means there's a bug.)

If that doesn't work, the only workaround I can think of is to clone the jobs you want to restart and then start the cloned versions.


(Tim Arp) #7

I tried to do the force stop from the master.
ldxx90elk16:/root> curl -XPOST 'localhost:9200/_xpack/ml/datafeeds/amq_queues_stats/_stop' -H 'Content-Type: application/json' -d'

{
"force": true,
"timeout": "30s"
}'
{"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"No datafeed with id [amq_queues_stats] exists"}],"type":"resource_not_found_exception","reason":"No datafeed with id [amq_queues_stats] exists"},"status":404}ldxx90elk16:/root>

So that is why, it didn't exist.

I cloned the job, and was able to start up the clone and it appears to be working.

Thanks for your help,
Tim


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.