Could not open job because no ML nodes with sufficient capacity were found

machine-learning

(seshachalam) #1

I have dedicated ml node with 64 GB memory, 10 CPU. I am using x-pack 6.4 with small index 900mb data,
Right now I can able to run 199 Jobs, but need to run more jobs, but have more remaining RAM nearly 40 GB RAM. Initially I had given 24GB Heap memory, so i have changed to 42GB, then also below error occured.

Configuration

node.ml: true
node.master: false
node.data: false
xpack.ml.max_open_jobs: 250
xpack.ml.max_machine_memory_percent: 90
xpack.ml.node_concurrent_job_allocations: 50

Error: "{\"error\":{\"root_cause\":[{\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\"}],\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\",\"caused_by\":{\"type\":\"illegal_state_exception\",\"reason\":\"Could not open job because no suitable nodes were found, allocation explanation [Not opening job [test_new200] on node [esmaster], because this node isn't a ml node.|Not opening job [test_new200] on node [{esmlnode1}{ml.machine_memory=67564761088}{ml.max_open_jobs=250}{ml.enabled=true}], because this node has insufficient available memory. Available memory for ML [20269428326], memory required by existing jobs [20169725860], estimated memory required for this job [115343360]|Not opening job [test_new200] on node [esnode1], because this node isn't a ml node.]\"}},\"status\":429}"
   
    at Array.forEach (<anonymous>)

(David Kyle) #2

Hi,

199 jobs wow! that's more than I've ever run good work.

The memory used by Machine learning jobs does not come from the JVM heap, the jobs run as separate processes called autodetect you should see 199 of the processes if you run ps assuming you are using Linux. The memory ml jobs can use comes from what is left after the JVM and OS take their share.

Elasticsearch recommends setting the JVM heap <32GB as documented here (search for compressed oops). Taking memory away from the JVM will be beneficial in this case.

From the error message:

ml.machine_memory = 67564761088
Available memory for ML = 20269428326
memory required by existing jobs= 20169725860

The machine has approximately 67GB of memory around 20GB of that is reported as being available to ml - this is determined by the xpack.ml.max_machine_memory_percent setting.

Your 199 jobs are already using close to the 20GB available memory so ml won't open the 200th job as not enough memory is available.

20GB is around 30% of 67GB and 30% is the default value of xpack.ml.max_machine_memory_percent so what I think has happened here is that the change you made to the configuration file has not been picked up. You should restart the ML node or update the xpack.ml.max_machine_memory_percent setting using the cluster update settings API.

Something like this but I haven't tested it.

PUT /_cluster/settings
{
    "transient" : {
        "xpack.ml.max_machine_memory_percent" : 40
    }
}

I would not give ml 90% of the machine's memory as that only leaves 6.6GB for the JVM and OS. You will need to experiment with the settings to see what works best for you, maybe a 16 or 20GB JVM heap is sufficient then you can give a higher percentage to ml - 40% should work for 200 jobs.

xpack.ml.node_concurrent_job_allocations: 50

This controls how many jobs can open at once. Opening many jobs at the same time is a drain on resource and opening 50 together is exacerbate that problem. Consider if you need to do this.


(seshachalam) #3

Thanks dkyle,

After, I have change max_machine_memory_percent : 50, can able to create 249 job.
But why configuration which i have mentioned in elasticsearch.yml not applied ?

PUT /_cluster/settings
{
    "transient" : {
        "xpack.ml.max_machine_memory_percent" : 50
    }
}

Right now , I have restarted the elasticsearch in ML node, and applied this transient settings using clusterAPI, but getting below Error .

Error: "{\"error\":{\"root_cause\":[{\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\"}],\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\",\"caused_by\":{\"type\":\"illegal_state_exception\",\"reason\":\"Could not open job because no suitable nodes were found, allocation explanation [Not opening job [test_new10] on node [esmaster], because this node isn't a ml node.]\"}},\"status\":429}"

/var/log/elasticsearch/elastisearch.log
[2018-09-02T16:49:22,716][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [esmlnode1] fatal error in thread [elasticsearch[esmlnode1][ml_utility][T#2301]], exiting
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_181]
at java.lang.Thread.start(Thread.java:717) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1025) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

Is ML deditcated node to give more JVM heap memory?


(David Kyle) #4

But why configuration which i have mentioned in elasticsearch.yml not applied ?

This is a cluster wide setting so it's best to add the value to the elasticsearch.yml file of every node and restart. The simplest solution is to use the cluster settings endpoint as you did. Note the setting is transient so will not survive a full cluster restart use persistent if you want it to last.

java.lang.OutOfMemoryError: unable to create new native thread

Looks like the ml node JVM could not allocate any more memory and this caused the JVM to stop, you may be at the limit of the available memory on the machine.

How big are your ml jobs? Look at how much memory the autodetect processes I mentioned earlier are using and try to find the optimal balance of JVM heap memory, ml memory and OS memory


(seshachalam) #5

HI @dkyle
I am creating very simple single metric job.
Why autodetect jobs needs more jvm heap memory, it is running outside of JVM right ?


(rich collier) #6

@sesha - You have misunderstood @dkyle - he means that there are 3 fundamental things competing for the overall memory of the system

  1. the O/S itself
  2. the JVM
  3. all of the autodetect processes that are running (1 process per ML job)

So, he's saying keep an eye on this.

Is there a reason why your jobs are all single metric? Can you not take advantage of doing multi-metric jobs, splitting the data along a categorical field?


(seshachalam) #7

We are evaluating X-pack and would like to determine the number of single metric jobs that can be created and run in a single node.( At the max we aim to create 1000 ML jobs in a single node) .

Please clarify the various configurations that need to be set/added to achieve the same.

Currently with 128 GB RAM & 31GB of JAVA Heap Memory and with the below configuration, we were able to create 300 ML jobs.
LimitNOFILE=65536
LimitNPROC=16384

Thanks for your help,
Sesha


(rich collier) #8

I think the best way to "push the limit" as to the number of jobs possible for a node would be to do the following:

  1. reduce the JVM heap size to something much, much smaller, like 4GB, or even 2GB. A dedicated ML node doesn't have the same JVM workload as a normal elasticsearch node (in terms of indexing, searching, etc.) so the demands on the JVM are less on an ML node.
  2. increase the following setting to its maximum value (of 90) on the ML node:
    xpack.ml.max_machine_memory_percent:90 (see docs)

You should be able to get a lot more jobs on that node now. I'm curious as to what you can get, so keep us posted.


(seshachalam) #9

HI @richcollier

I am getting below error, when java threads count reaches to 4717. I have set both LimitNPROC, LimitNOFILE to infinity.

java.lang.OutOfMemoryError: unable to create new native thread

When I try to create,start,delete a ml job its creating threads in that ML nodes, but after completing each task its not releasing threads and its keeps increasing total threads count.


(rich collier) #10

What client are you using to invoke the job creation/start/delete API?

What version of ML are you using?


(seshachalam) #11

X-pack 6.4
I am creating jobs from Kibana UI


(David Roberts) #12

Which Linux distribution are you using?

If it's a recent version of Ubuntu or SLES then you may benefit from adding TasksMax=10000 to your service file adjacent to where you have LimitNPROC=infinity.

If it's RHEL or CentOS 7 then adding the TasksMax setting will be a syntax error, so don't.

More details in https://www.elastic.co/blog/we-are-out-of-memory-systemd-process-limits


(seshachalam) #13

Linux version 4.9.0-8-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.110-3+deb9u3 (2018-08-19)

Why threads count not reduced after completing ML Job(deletion, creation)?


(rich collier) #14

Sesha - before moving on to further questions, please confirm whether or not you've tried the TasksMax setting as recommended above.


(seshachalam) #15

Thanks @droberts195

Now, I am able to create 700 Jobs and after that getting below an error
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@4211afeb on QueueResizingEsThreadPoolExecutor


(David Roberts) #16

So just to confirm, did you set TasksMax to a high number to get to 700 jobs?

For the new exception please could you paste the full stack trace, plus any lines immediately before or after that name a specific threadpool? I need to know which threadpool it relates to to understand why it happened.

Finally, for your question about why the number of threads used by ML never goes down, it’s because we’re using fixed size threadpools rather than scaling threadpools. The threads get created lazily, but don’t get stopped until the node is shut down. We have a plan to improve that for 7.0 - see https://github.com/elastic/elasticsearch/issues/29809


(system) #17

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.