ML Job hard limit error on Linux

Hi,

Maybe somebody can help me to solve the following problem

Problem description: If I create and start new multimetric job, then memory status always changes to hard limit. (This job worked on Windows without any problem)

OS: Ubuntu 16.04.5 LTS (GNU/Linux 4.15.18-1-pve x86_64)
Memory: 12GB
Elastic: 6.4.1
Java: Oracle 1.8.0_181-b13 (x64)
Heap size: 4GB
vm.max_map_count=262144

Index:
Docs Count: 8431
Storage Size: 2.2mb

Job:
established_model_memory:143.2 KB
model_memory_limitI I tried different values from 12MB to 1200MB

job error message:

Job memory status changed to hard_limit at 83.7kb; adjust the analysis_limits.model_memory_limit setting to ensure all data is analyzed.

If I create another job on a same but bigger index (with 3 millon document) then the limit value is 69mb in the error message.

I didn't see any error message in the elastic log.

Error was replicated on another linux machine.

Can you provide the following info on your ML Job?

GET _xpack/ml/anomaly_detectors/yourjobnamehere/_stats?pretty

{
"count": 1,
"jobs": [
{
"job_id": "dev2",
"data_counts": {
"job_id": "dev2",
"processed_record_count": 8431,
"processed_field_count": 16862,
"input_bytes": 741326,
"input_field_count": 16862,
"invalid_date_count": 0,
"missing_field_count": 0,
"out_of_order_timestamp_count": 0,
"empty_bucket_count": 8,
"sparse_bucket_count": 1,
"bucket_count": 431,
"earliest_record_timestamp": 1199059200000,
"latest_record_timestamp": 1459900800000,
"last_data_time": 1538492502740,
"latest_empty_bucket_timestamp": 1451520000000,
"latest_sparse_bucket_timestamp": 1388016000000,
"input_record_count": 8431
},
"model_size_stats": {
"job_id": "dev2",
"result_type": "model_size_stats",
"model_bytes": 144616,
"total_by_field_count": 5,
"total_over_field_count": 0,
"total_partition_field_count": 6,
"bucket_allocation_failures_count": 398,
"memory_status": "hard_limit",
"log_time": 1538492507000,
"timestamp": 1458777600000
},
"forecasts_stats": {
"total": 0,
"forecasted_jobs": 0
},
"state": "closed"
}
]
}

Thanks - sorry to ask again, but actually I really wanted to get the full job details, so can you please rerun

GET _xpack/ml/anomaly_detectors/dev2/?pretty

(this is, without the _stats part)

{
"count": 1,
"jobs": [
{
"job_id": "dev2",
"job_type": "anomaly_detector",
"job_version": "6.4.1",
"description": "",
"create_time": 1538492492071,
"finished_time": 1538492509358,
"established_model_memory": 144616,
"analysis_config": {
"bucket_span": "7d",
"detectors": [
{
"detector_description": "non_null_sum(ve_costamountactual)",
"function": "non_null_sum",
"field_name": "ve_costamountactual",
"partition_field_name": "ve_itemno.keyword",
"detector_index": 0
}
],
"influencers": [
"ve_itemno.keyword"
]
},
"analysis_limits": {
"model_memory_limit": "12mb",
"categorization_examples_limit": 4
},
"data_description": {
"time_field": "ve_postingdate",
"time_format": "epoch_ms"
},
"model_snapshot_retention_days": 1,
"custom_settings": {
"created_by": "multi-metric-wizard"
},
"model_snapshot_id": "1538492507",
"results_index_name": "shared"
}
]
}

Thank you - can you also tell me the approximate cardinality of the field ve_itemno.keyword?

GET yourindexname/_search
{
  "size": 0,
  "aggs": {
    "cardinality": {
      "cardinality": {
        "field": "ve_itemno.keyword"
      }
    }
  }
}
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 8431,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "cardinality": {
      "value": 5
    }
  }
}

Thanks for supplying the info - I'll discuss this with others internally and hopefully get back to you soon.

Some information, that can help your investigation. My original index has 3 million documents where ve_itemno cardinality is 16141.

I wanted to create a job, that proccess only the small part of documents.

1. solution attempt

I created an advanced job and selected documents with a query

2. solution attempt

I created a new smaller index form the original with the reindex command and then I used a multimetric job.

Both solution worked on Windows, but failed on Linux

A new small index was created normal way (PUT and logstash), but I received the same hard limit error message.

We believe the bug here is that hard limit can be incorrectly triggered too soon when the bucket span is 1 day or longer. We have made a change that should resolve this problem.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.