Elastic cloud basic setup: Could not open job because no ML nodes with sufficient capacity were found

Hello,

I am running on a elastic cloud cluster, with 1Gb for the ML node (default)

I have 1 job running OK, and when I try to run a second one, I got the following.

How can I know the capacity of the ML node? when I add some jobs, how can I know the remaining capacity?

Thx

{
  "changed": false,
  "connection": "Close",
  "content": "{\"error\":{\"root_cause\":[{\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\"}],\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\",\"caused_by\":{\"type\":\"illegal_state_exception\",\"reason\":\"Could not open job because no suitable nodes were found, allocation explanation [Not opening job [job-rundown_time_aps5000_4] on node [instance-0000000042], because this node isn't a ml node.|Not opening job [job-rundown_time_aps5000_4] on node [tiebreaker-0000000044], because this node isn't a ml node.|Not opening job [job-rundown_time_aps5000_4] on node [{instance-0000000045}{ml.machine_memory=1073741824}{ml.max_open_jobs=20}{ml.enabled=true}], because this node has insufficient available memory. Available memory for ML [440234147], memory required by existing jobs [108105012], estimated memory required for this job [435159040]|Not opening job [job-rundown_time_aps5000_4] on node [instance-0000000043], because this node isn't a ml node.]\"}},\"status\":429}",
  "content_length": "1073",
  "content_type": "application/json; charset=UTF-8",
  "date": "Mon, 25 Feb 2019 15:22:19 GMT",
  "json": {
    "error": {
      "caused_by": {
        "reason": "Could not open job because no suitable nodes were found, allocation explanation [Not opening job [job-rundown_time_aps5000_4] on node [instance-0000000042], because this node isn't a ml node.|Not opening job [job-rundown_time_aps5000_4] on node [tiebreaker-0000000044], because this node isn't a ml node.|Not opening job [job-rundown_time_aps5000_4] on node [{instance-0000000045}{ml.machine_memory=1073741824}{ml.max_open_jobs=20}{ml.enabled=true}], because this node has insufficient available memory. Available memory for ML [440234147], memory required by existing jobs [108105012], estimated memory required for this job [435159040]|Not opening job [job-rundown_time_aps5000_4] on node [instance-0000000043], because this node isn't a ml node.]",
        "type": "illegal_state_exception"
      },
      "reason": "Could not open job because no ML nodes with sufficient capacity were found",
      "root_cause": [{
        "reason": "Could not open job because no ML nodes with sufficient capacity were found",
        "type": "status_exception"
      }],
      "type": "status_exception"
    },
    "status": 429
  },
  "msg": "Status code was 429 and not [201, 200]: HTTP Error 429: Too Many Requests",
  "redirected": false,
  "server": "fp/4xxxxx",
  "status": 429,
  "url": "https://xxxxx.eu-west-1.aws.found.io:9243/_xpack/ml/anomaly_detectors/job-rundown_time_aps5000_4/_open",
  "x_found_handling_cluster": "xxxxx",
  "x_found_handling_instance": "instance-0000000042",
  "x_found_handling_server": "xxxxx"
}

Currently, you have a couple of options:

  1. You can try to pre-calculate the current memory usage of ML on the node and then estimate the headroom you have for a new job, or
  2. Do what you did - make an attempt to open a new job and just let the node tell you that it doesn't have enough room for it.

The first obviously requires some intimate knowledge about how memory is allocated by ML on the node. Namely:

  • memory is only used by jobs that are in the open state
  • each job has approximately 100MB overhead
  • each job has an additional "model memory" that is shown as model_bytes in the Counts tab of the Job Management page or via the job stats API. This value is approximately equal to 20kB to 30kB for every unique time series in the model (the number of splits/partitions)
  • the ML processes do not have access to the entire memory space of the node, but rather approximately 30% of it (this is controlled via a node setting called xpack.ml.max_machine_memory_percent. This value is a little more complicated to determine in Cloud because Cloud uses containers. I've seen this setting described as:

xpack.ml.max_machine_memory_percent: min(max_machine_memory, round((container_size - JVM_heap_size - non_jvm_process_overhead) / container_size * 100))

Bottom line is that it is a little complicated - and we don't yet have an easier way to figure out your future sizing needs for ML, especially in Cloud. But, keep in mind that the 1GB node on Cloud is gratis and is really meant for people to "try out" ML. It is not expected that you'd run many production-worthy, high-availability ML jobs on a single 1GB node :wink:

Some more information can be found on these blogs:


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.