Machine learning error: persistent task is awating node assignment


I hope this message finds Elastic community and their loved ones safe and healthy.

I have a three node cluster:

  1. Node-1 and Node-2 hold data & carry out computation:
  • both have has 24 GB of RAM with jvm set to 12 GB
  • 3 vCPUs @ 3.4 GHz per vCPU respectively.
  1. All of the node's are running on separate SSDs and IOPS are under 50% even at peak load.
  2. Average CPU usage for node-1 and node-2 is ~50% for the last 7 days.
  3. Average Memory utilisation is ~30% for the last 7 days. (JVM heap is around ~55%)
  4. Node-3 is a voting only node.

For all of my ML jobs I get persistent task is awaiting node assignment. Do I need to spin up a dedicated ML node? or can I assign to the current nodes?

Which version are you running?

Are node-1 and node-2 general purpose nodes with all roles? In other words, you haven't put false or explicitly specified node.roles in either of their elasticsearch.yml files?

If those nodes have all the roles, including the ML role then you don't necessarily have to spin up a dedicated node.

I am wondering if you are suffering from the poor communication outlined in [ML] Make the job notifications related to node assignment more user friendly · Issue #79270 · elastic/elasticsearch · GitHub. Often when ML jobs cannot be assigned the reason is transitory and they do get assigned quickly afterwards. But we show warnings for 24 hours which make people think there's a permanent problem. Where exactly are you seeing persistent task is awaiting node assignment? And are any of your jobs running?

If some jobs are running but not others after you restarted your cluster then it could be that the cluster was big enough to run all the jobs originally but isn't now after the jobs saw a lot of data to model and increased in size. If that's what's happened it will be possible to tell from the output of the Get anomaly detection job statistics API.

Hello @droberts195, thank you very much for your reply.

1. I am running Elastic 7.15.2 with platinum trial license 
  1. For nodes 1 & 2 following options are set:
node.master: true
node.voting_only: false
  1. When I create custom ML jobs they do run and finish, but I can't seem to keep them always "on". Here is snapshot of ML page where I see a different error too & ML has 2 nodes as per the status page.

  1. When I start prebuilt jobs such as those part of Elastic Security, they keep shutting down abruptly. I've started jobs again and I will check how they continue:

  1. Some custom jobs end with "hard limit" in the memory, here is the output of their status. What do I need to change here?
  "count" : 1,
  "jobs" : [
      "job_id" : "dns_rare_oxford",
      "data_counts" : {
        "job_id" : "dns_rare_oxford",
        "processed_record_count" : 486569375,
        "processed_field_count" : 486569375,
        "input_bytes" : 41188106693,
        "input_field_count" : 486569375,
        "invalid_date_count" : 0,
        "missing_field_count" : 0,
        "out_of_order_timestamp_count" : 0,
        "empty_bucket_count" : 0,
        "sparse_bucket_count" : 8,
        "bucket_count" : 18,
        "earliest_record_timestamp" : 1632442643000,
        "latest_record_timestamp" : 1632508200000,
        "last_data_time" : 1635241410029,
        "latest_sparse_bucket_timestamp" : 1632484800000,
        "input_record_count" : 486569375,
        "log_time" : 1635241410029
      "model_size_stats" : {
        "job_id" : "dns_rare_oxford",
        "result_type" : "model_size_stats",
        "model_bytes" : 44051824,
        "peak_model_bytes" : 44051824,
        "model_bytes_exceeded" : 20890,
        "model_bytes_memory_limit" : 44040192,
        "total_by_field_count" : 11732,
        "total_over_field_count" : 0,
        "total_partition_field_count" : 2,
        "bucket_allocation_failures_count" : 18,
        "memory_status" : "hard_limit",
        "assignment_memory_basis" : "model_memory_limit",
        "categorized_doc_count" : 0,
        "total_category_count" : 0,
        "frequent_category_count" : 0,
        "rare_category_count" : 0,
        "dead_category_count" : 0,
        "failed_category_count" : 0,
        "categorization_status" : "ok",
        "log_time" : 1635241417030,
        "timestamp" : 1632502800000
      "forecasts_stats" : {
        "total" : 0,
        "forecasted_jobs" : 0
      "state" : "closed",
      "timing_stats" : {
        "job_id" : "dns_rare_oxford",
        "bucket_count" : 18,
        "total_bucket_processing_time_ms" : 316.0,
        "minimum_bucket_processing_time_ms" : 4.0,
        "maximum_bucket_processing_time_ms" : 104.0,
        "average_bucket_processing_time_ms" : 17.555555555555557,
        "exponential_average_bucket_processing_time_ms" : 6.193679389034066,
        "exponential_average_bucket_processing_time_per_hour_ms" : 28.0

Please do let me know if more information is required.

  1. The default for is true, so nodes 1 and 2 are ML nodes as well as master nodes. So there's nothing in the elasticsearch.yml settings that would cause a problem for what you are doing.
  2. The jobs in the screenshot are in the failed state. Something is going wrong with them. The clue will be in the log files of the node they were running on at the time when they failed. If you search for the word failed in your logs then the other messages around that area will probably indicate why one of the jobs failed. If it's too low level to understand and doesn't contain any confidential information you could paste an extract of the log around the time of an ML job failure in this thread and I will have a look.
  3. The "5 jobs unavailable" in the security app screenshot corresponds to the 5 failed jobs in the ML app screenshot. It's the same problem. We need to work out what's making them fail.
  4. For the jobs that went to hard_limit, you would need to increase { "analysis_limits" : { "model_memory_limit" }} - search for model_memory_limit in Update anomaly detection jobs API | Elasticsearch Guide [7.15] | Elastic for more information.