Unable to open Machine Learning job

machine-learning

(Alon Goldstein) #1

Hi,

I'm using Elasticsearch and Kibana version 6.2.4 with Platinum license.
Every time I create a Machine Learning job (through API or Kibana), I'm getting the following error in ES log:

[2018-07-31T08:36:30,172][WARN ][r.suppressed ] path: /_xpack/ml/anomaly_detectors/aa/_open, params: {job_id=aa}
org.elasticsearch.transport.RemoteTransportException: [zlt23646.vci.att.com][135.68.47.160:9300][cluster:admin/xpack/ml/job/open]
Caused by: org.elasticsearch.ElasticsearchException: Unexpected job state [failed] while waiting for job to be opened
at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:43) ~[?:?]
at org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:351) ~[?:?]
at org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:326) ~[?:?]
at org.elasticsearch.xpack.core.persistent.PersistentTasksService.lambda$waitForPersistentTaskStatus$4(PersistentTasksService.java:157) ~[?:?]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:186) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateListeners$7(ClusterApplierService.java:509) ~[elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3527) ~[?:1.8.0_91]
at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743) ~[?:1.8.0_91]
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_91]
at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:506) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:489) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

What am I doing wrong?

Thanks,
Alon


(David Kyle) #2

Hi Alon,

Can you tell me a little more about your set up please. Which OS are you using, have you upgraded Elasticsearch recently? Does this error occur for all jobs you create or just certain ones and can you share the job configuration ?

You have a Platinum license have you raised the issue with support?


(David Roberts) #3

The trick to solving this will be to look in the log file of the node where the ML job attempted to start. That will contain the underlying error message. How many ML nodes are in your cluster? By that I mean nodes with node.ml: true in elasticsearch.yml or no mention of node.ml (since true is the default). If it’s just a few maybe you could check their logs around the time of the error.


(Alon Goldstein) #4

I am using a two node cluster, running on Linux machines.
No upgrade with ML jobs was done.
This error occurs for every job I'm trying to create, no matter what the data or job configuration is.

I should probably mention that I changed the configuration for the elastic TMP folder, since previously the ML job error indicated it couldn't find the folder in the /tmp folder on the machine.


(Alon Goldstein) #5

What I shared in the original message is the exact error I'm getting in the log files.
No other underlying error messages unfortunately...

there is no mention of node.ml in elasticsearch .yml
Also, when restarting the node, the log file states that node.ml=true


(David Roberts) #6

So it’s a single node cluster?


(David Roberts) #7

Sorry I only saw your second reply.

The temp problem is this: https://github.com/elastic/elasticsearch/issues/31732

It can cause the original error you posted. Did you change the temp directory on both nodes? If you only changed it on one node then try changing it on the second one as well and see if that solves the ML job startup problem.

The original error you posted was a remote transport exception, which means the underlying problem was on the other node. So I’m pretty sure there will be an exception in the log on that other node. But probably the exception is the missing temp directory, and you’re not connecting it with the failure to open the job. If that’s the case then explicitly setting the temp directory on both nodes will solve it.


(Alon Goldstein) #8

Works!!!

Thanks a lot!


(Mark Walkom) #9