I hadn't looked at my machine learning jobs in a while, when I did look at them they were all in a state of
JobState = closed, DatafeedState = started. The Latest timestamp was from a month ago.
How can I get these working again? I tried to issue a stop and I get this.
Could not stop datafeed for http_requests_count [status_exception] Cannot stop datafeed [datafeed-http_requests_count] because the datafeed does not have an assigned node. Use force stop to stop the datafeed
I tried the force stop and that fails with this.
{
"error": {
"root_cause": [
{
"type": "send_request_transport_exception",
"reason": "[ldxx90elk16-isgeis][204.54.165.114:9300][cluster:admin/xpack/ml/datafeed/stop]"
}
],
"type": "null_pointer_exception",
"reason": null
},
"status": 500
}
Any ideas on how to get these running again?
Thanks,
Tim
Something abnormal has happened because it shouldn't be possible to have a started datafeed corresponding to a closed job.
At the time you got that "null_pointer_exception" response there could well be a stack trace logged to the Elasticsearch log file. If there is, please paste it into a reply. It could be in the log of the node you submitted the force close request to, or on 204.54.165.114 if that's a different node.
Also, please let us know which version of Elasticsearch you're running.
I'm not sure how this datafeed has ended up in the state it's in. 204.54.165.114 is your master node, so that kills the theory that the node that's causing the null_pointer_exception from this request isn't part of the cluster any more.
Sending the force stop request to the master node directly (using curl or some other mechanism that will let you specify the exact node to send it to) might make a difference. The reason is that the null_pointer_exception is happening as the request is sent on from the node that received it to the master node. If it didn't have to be transported within the cluster it might not hit the bug it's currently hitting. (Obviously it shouldn't matter which node the request goes to, but the null_pointer_exception means there's a bug.)
If that doesn't work, the only workaround I can think of is to clone the jobs you want to restart and then start the cloned versions.
I tried to do the force stop from the master.
ldxx90elk16:/root> curl -XPOST 'localhost:9200/_xpack/ml/datafeeds/amq_queues_stats/_stop' -H 'Content-Type: application/json' -d'
{
"force": true,
"timeout": "30s"
}'
{"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"No datafeed with id [amq_queues_stats] exists"}],"type":"resource_not_found_exception","reason":"No datafeed with id [amq_queues_stats] exists"},"status":404}ldxx90elk16:/root>
So that is why, it didn't exist.
I cloned the job, and was able to start up the clone and it appears to be working.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.