Elastic cloud basic setup: Could not open job because no ML nodes with sufficient capacity were found

dao · February 25, 2019, 3:35pm

Hello,

I am running on a elastic cloud cluster, with 1Gb for the ML node (default)

I have 1 job running OK, and when I try to run a second one, I got the following.

How can I know the capacity of the ML node? when I add some jobs, how can I know the remaining capacity?

Thx

{
  "changed": false,
  "connection": "Close",
  "content": "{\"error\":{\"root_cause\":[{\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\"}],\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\",\"caused_by\":{\"type\":\"illegal_state_exception\",\"reason\":\"Could not open job because no suitable nodes were found, allocation explanation [Not opening job [job-rundown_time_aps5000_4] on node [instance-0000000042], because this node isn't a ml node.|Not opening job [job-rundown_time_aps5000_4] on node [tiebreaker-0000000044], because this node isn't a ml node.|Not opening job [job-rundown_time_aps5000_4] on node [{instance-0000000045}{ml.machine_memory=1073741824}{ml.max_open_jobs=20}{ml.enabled=true}], because this node has insufficient available memory. Available memory for ML [440234147], memory required by existing jobs [108105012], estimated memory required for this job [435159040]|Not opening job [job-rundown_time_aps5000_4] on node [instance-0000000043], because this node isn't a ml node.]\"}},\"status\":429}",
  "content_length": "1073",
  "content_type": "application/json; charset=UTF-8",
  "date": "Mon, 25 Feb 2019 15:22:19 GMT",
  "json": {
    "error": {
      "caused_by": {
        "reason": "Could not open job because no suitable nodes were found, allocation explanation [Not opening job [job-rundown_time_aps5000_4] on node [instance-0000000042], because this node isn't a ml node.|Not opening job [job-rundown_time_aps5000_4] on node [tiebreaker-0000000044], because this node isn't a ml node.|Not opening job [job-rundown_time_aps5000_4] on node [{instance-0000000045}{ml.machine_memory=1073741824}{ml.max_open_jobs=20}{ml.enabled=true}], because this node has insufficient available memory. Available memory for ML [440234147], memory required by existing jobs [108105012], estimated memory required for this job [435159040]|Not opening job [job-rundown_time_aps5000_4] on node [instance-0000000043], because this node isn't a ml node.]",
        "type": "illegal_state_exception"
      },
      "reason": "Could not open job because no ML nodes with sufficient capacity were found",
      "root_cause": [{
        "reason": "Could not open job because no ML nodes with sufficient capacity were found",
        "type": "status_exception"
      }],
      "type": "status_exception"
    },
    "status": 429
  },
  "msg": "Status code was 429 and not [201, 200]: HTTP Error 429: Too Many Requests",
  "redirected": false,
  "server": "fp/4xxxxx",
  "status": 429,
  "url": "https://xxxxx.eu-west-1.aws.found.io:9243/_xpack/ml/anomaly_detectors/job-rundown_time_aps5000_4/_open",
  "x_found_handling_cluster": "xxxxx",
  "x_found_handling_instance": "instance-0000000042",
  "x_found_handling_server": "xxxxx"
}

richcollier · February 26, 2019, 10:22pm

Currently, you have a couple of options:

You can try to pre-calculate the current memory usage of ML on the node and then estimate the headroom you have for a new job, or
Do what you did - make an attempt to open a new job and just let the node tell you that it doesn't have enough room for it.

The first obviously requires some intimate knowledge about how memory is allocated by ML on the node. Namely:

memory is only used by jobs that are in the open state
each job has approximately 100MB overhead
each job has an additional "model memory" that is shown as model_bytes in the Counts tab of the Job Management page or via the job stats API. This value is approximately equal to 20kB to 30kB for every unique time series in the model (the number of splits/partitions)
the ML processes do not have access to the entire memory space of the node, but rather approximately 30% of it (this is controlled via a node setting called xpack.ml.max_machine_memory_percent. This value is a little more complicated to determine in Cloud because Cloud uses containers. I've seen this setting described as:

xpack.ml.max_machine_memory_percent: min(max_machine_memory, round((container_size - JVM_heap_size - non_jvm_process_overhead) / container_size * 100))

Bottom line is that it is a little complicated - and we don't yet have an easier way to figure out your future sizing needs for ML, especially in Cloud. But, keep in mind that the 1GB node on Cloud is gratis and is really meant for people to "try out" ML. It is not expected that you'd run many production-worthy, high-availability ML jobs on a single 1GB node

Some more information can be found on these blogs:

system · March 26, 2019, 10:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CLOUD TRIAL: Could not open job because no ML nodes with sufficient capacity were found [SOLVED] Elasticsearch elastic-stack-machine-learning	3	820	June 26, 2019
Could not open job because no ML nodes with sufficient capacity were found Elasticsearch elastic-stack-machine-learning	16	6521	October 13, 2018
SIEM ML Jobs Stuck \| No Node Found to Open Job Elasticsearch elastic-stack-security , elastic-stack-machine-learning	2	1282	June 18, 2020
Machine Learning: Could not open job Elasticsearch elastic-stack-machine-learning	2	1568	November 12, 2017
Error opening machine learning Elasticsearch elastic-stack-machine-learning	7	619	April 30, 2018

Elastic cloud basic setup: Could not open job because no ML nodes with sufficient capacity were found

Related topics