Machine learning error: persistent task is awating node assignment

parthmaniar · November 24, 2021, 10:57am

Hello,

I hope this message finds Elastic community and their loved ones safe and healthy.

I have a three node cluster:

Node-1 and Node-2 hold data & carry out computation:

both have has 24 GB of RAM with jvm set to 12 GB
3 vCPUs @ 3.4 GHz per vCPU respectively.

All of the node's are running on separate SSDs and IOPS are under 50% even at peak load.
Average CPU usage for node-1 and node-2 is ~50% for the last 7 days.
Average Memory utilisation is ~30% for the last 7 days. (JVM heap is around ~55%)
Node-3 is a voting only node.

For all of my ML jobs I get persistent task is awaiting node assignment. Do I need to spin up a dedicated ML node? or can I assign to the current nodes?

droberts195 · November 24, 2021, 11:33am

Which version are you running?

Are node-1 and node-2 general purpose nodes with all roles? In other words, you haven't put node.ml: false or explicitly specified node.roles in either of their elasticsearch.yml files?

If those nodes have all the roles, including the ML role then you don't necessarily have to spin up a dedicated node.

I am wondering if you are suffering from the poor communication outlined in [ML] Make the job notifications related to node assignment more user friendly · Issue #79270 · elastic/elasticsearch · GitHub. Often when ML jobs cannot be assigned the reason is transitory and they do get assigned quickly afterwards. But we show warnings for 24 hours which make people think there's a permanent problem. Where exactly are you seeing persistent task is awaiting node assignment? And are any of your jobs running?

If some jobs are running but not others after you restarted your cluster then it could be that the cluster was big enough to run all the jobs originally but isn't now after the jobs saw a lot of data to model and increased in size. If that's what's happened it will be possible to tell from the output of the Get anomaly detection job statistics API.

parthmaniar · November 25, 2021, 4:33am

Hello @droberts195, thank you very much for your reply.

Edit: 
1. I am running Elastic 7.15.2 with platinum trial license

For nodes 1 & 2 following options are set:

node.master: true
node.voting_only: false

When I create custom ML jobs they do run and finish, but I can't seem to keep them always "on". Here is snapshot of ML page where I see a different error too & ML has 2 nodes as per the status page.

When I start prebuilt jobs such as those part of Elastic Security, they keep shutting down abruptly. I've started jobs again and I will check how they continue:

Some custom jobs end with "hard limit" in the memory, here is the output of their status. What do I need to change here?

{
  "count" : 1,
  "jobs" : [
    {
      "job_id" : "dns_rare_oxford",
      "data_counts" : {
        "job_id" : "dns_rare_oxford",
        "processed_record_count" : 486569375,
        "processed_field_count" : 486569375,
        "input_bytes" : 41188106693,
        "input_field_count" : 486569375,
        "invalid_date_count" : 0,
        "missing_field_count" : 0,
        "out_of_order_timestamp_count" : 0,
        "empty_bucket_count" : 0,
        "sparse_bucket_count" : 8,
        "bucket_count" : 18,
        "earliest_record_timestamp" : 1632442643000,
        "latest_record_timestamp" : 1632508200000,
        "last_data_time" : 1635241410029,
        "latest_sparse_bucket_timestamp" : 1632484800000,
        "input_record_count" : 486569375,
        "log_time" : 1635241410029
      },
      "model_size_stats" : {
        "job_id" : "dns_rare_oxford",
        "result_type" : "model_size_stats",
        "model_bytes" : 44051824,
        "peak_model_bytes" : 44051824,
        "model_bytes_exceeded" : 20890,
        "model_bytes_memory_limit" : 44040192,
        "total_by_field_count" : 11732,
        "total_over_field_count" : 0,
        "total_partition_field_count" : 2,
        "bucket_allocation_failures_count" : 18,
        "memory_status" : "hard_limit",
        "assignment_memory_basis" : "model_memory_limit",
        "categorized_doc_count" : 0,
        "total_category_count" : 0,
        "frequent_category_count" : 0,
        "rare_category_count" : 0,
        "dead_category_count" : 0,
        "failed_category_count" : 0,
        "categorization_status" : "ok",
        "log_time" : 1635241417030,
        "timestamp" : 1632502800000
      },
      "forecasts_stats" : {
        "total" : 0,
        "forecasted_jobs" : 0
      },
      "state" : "closed",
      "timing_stats" : {
        "job_id" : "dns_rare_oxford",
        "bucket_count" : 18,
        "total_bucket_processing_time_ms" : 316.0,
        "minimum_bucket_processing_time_ms" : 4.0,
        "maximum_bucket_processing_time_ms" : 104.0,
        "average_bucket_processing_time_ms" : 17.555555555555557,
        "exponential_average_bucket_processing_time_ms" : 6.193679389034066,
        "exponential_average_bucket_processing_time_per_hour_ms" : 28.0
      }
    }
  ]
}

Please do let me know if more information is required.

droberts195 · November 25, 2021, 9:54am

The default for node.ml is true, so nodes 1 and 2 are ML nodes as well as master nodes. So there's nothing in the elasticsearch.yml settings that would cause a problem for what you are doing.
The jobs in the screenshot are in the failed state. Something is going wrong with them. The clue will be in the log files of the node they were running on at the time when they failed. If you search for the word failed in your logs then the other messages around that area will probably indicate why one of the jobs failed. If it's too low level to understand and doesn't contain any confidential information you could paste an extract of the log around the time of an ML job failure in this thread and I will have a look.
The "5 jobs unavailable" in the security app screenshot corresponds to the 5 failed jobs in the ML app screenshot. It's the same problem. We need to work out what's making them fail.
For the jobs that went to hard_limit, you would need to increase { "analysis_limits" : { "model_memory_limit" }} - search for model_memory_limit in Update anomaly detection jobs API | Elasticsearch Guide [7.15] | Elastic for more information.

system · December 23, 2021, 9:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SIEM ML Jobs Stuck \| No Node Found to Open Job Elasticsearch elastic-stack-security , elastic-stack-machine-learning	2	1232	June 18, 2020
Machine learning, No node found to open the job because job memory requirements are stale Elasticsearch	1	458	February 26, 2021
Resource Utilization Machine Learning Elasticsearch elastic-stack-machine-learning	8	1430	June 16, 2022
Less number of active ML Node in anomaly detection jobs Elasticsearch elastic-stack-machine-learning	6	622	December 8, 2020
No ML nodes with sufficient capacity Kibana elastic-stack-machine-learning	3	718	May 24, 2022

Machine learning error: persistent task is awating node assignment

Related Topics