ML memory hits hard_limit - closed

I am running an ML job that keeps hitting the hard_limit on memory status,
closing this I'm assuming this has something to do with the variables I've chosen.

I tried to set the model memory limit up from the default(4096) in the elasticsearch.yml
#xpack.ml.max_model_memory_limit: 8192
(I'm not a complete fool :slight_smile: obviously I don't have it commented out)
but I must have something wrong because ES won't restart.

Am I missing something? Must have something to do with the structure of my ML but

To change the model memory limit for a job, you have to set the model_memory_limit field of the analysis_limits object in the job config. See https://www.elastic.co/guide/en/elasticsearch/reference/5.6/ml-job-resource.html#ml-apilimits

xpack.ml.max_model_memory_limit is a setting that allows an administrator to cap the maximum value that people creating individual jobs can use for the model_memory_limit field of the analysis_limits object. On its own it doesn't increase the limit used by any job. (Also, if you don't have a requirement for such a cap then it's best not to set it, because the default is 0 which means no cap.)

If you got an error when using xpack.ml.max_model_memory_limit it's probably because the setting was only introduced in version 6.0 and you're running 5.x. Elasticsearch doesn't tolerate unknown settings in config files, and this setting is unknown in 5.x.

Hmm, I've had that turned off and I still keep running into this -

it's running on Event ID 4688 event_data.CommandLine keep running into this 'hard limit'

any suggestions?

Please could you post the full job JSON? To get it, expand the row that you pasted the image of, select the "JSON" tab and copy everything in it.

If some names in it are confidential feel free to change them to something more generic, but please try to keep the structure, limits and counts the same as the real job.

Thanks David, I'm still going through all this stuff and working with windows cmd line logs from 4688 to find 'rare' events. Basically, malicious activity on the command line--
In this json it just hit the 'soft limit'
{
"job_id": "4thwincmd",
"job_type": "anomaly_detector",
"job_version": "6.0.0",
"description": "4th attempt at rare win cmd",
"create_time": 1511971135730,
"analysis_config": {
"bucket_span": "15m",
"detectors": [
{
"detector_description": "rare by "event_data.CommandLine"",
"function": "rare",
"by_field_name": "event_data.CommandLine",
"detector_rules": [],
"detector_index": 0
}
],
"influencers": [
"computer_name",
"event_data.CommandLine"
]
},
"data_description": {
"time_field": "@timestamp",
"time_format": "epoch_ms"
},
"model_snapshot_retention_days": 1,
"results_index_name": "custom-4thwincmd",
"data_counts": {
"job_id": "4thwincmd",
"processed_record_count": 20655561,
"processed_field_count": 39339059,
"input_bytes": 4168302624,
"input_field_count": 39339059,
"invalid_date_count": 0,
"missing_field_count": 1972063,
"out_of_order_timestamp_count": 0,
"empty_bucket_count": 36656,
"sparse_bucket_count": 2,
"bucket_count": 37870,
"earliest_record_timestamp": 1475600000994,
"latest_record_timestamp": 1509682819899,
"last_data_time": 1511972058487,
"latest_empty_bucket_timestamp": 1508674500000,
"latest_sparse_bucket_timestamp": 1508682600000,
"input_record_count": 20655561
},
"model_size_stats": {
"job_id": "4thwincmd",
"result_type": "model_size_stats",
"model_bytes": 2854754622,
"total_by_field_count": 3922953,
"total_over_field_count": 0,
"total_partition_field_count": 2,
"bucket_allocation_failures_count": 0,
"memory_status": "soft_limit",
"log_time": 1511972016000,
"timestamp": 1509678000000
},
"datafeed_config": {
"datafeed_id": "datafeed-4thwincmd",
"job_id": "4thwincmd",
"query_delay": "60s",
"frequency": "450s",
"indices": [
"winlogbeat-*"
],
"types": [
"wineventlog"
],
"query": {
"match_all": {
"boost": 1
}
},
"scroll_size": 1000,
"chunking_config": {
"mode": "auto"
},
"state": "started",
"node": {
"id": "qrSYXO4NRd-uq2rHT9_CwA",
"name": "qrSYXO4",
"ephemeral_id": "gsL-4RwXRduWEuBw6BTSYw",
"transport_address": "IPADDRESS:9300",
"attributes": {
"ml.max_open_jobs": "10",
"ml.enabled": "true"
}
}
},
"state": "opened",
"node": {
"id": "qrSYXO4NRd-uq2rHT9_CwA",
"name": "qrSYXO4",
"ephemeral_id": "gsL-4RwXRduWEuBw6BTSYw",
"transport_address": "IPADDRESS:9300",
"attributes": {
"ml.max_open_jobs": "10",
"ml.enabled": "true"
}
},
"open_time": "922s"
}

Hi Matt,

The reason why the job memory is so high is because there are so many (almost 4 million) instances of the by_field

Since your job is doing rare by "event_data.CommandLine", this indicates that the level of uniqueness of the command-lines is really high. I think I remember you saying before that there was a unique ID in each command line invocation. If this is the case, then you'll never be able to find rare command invocations.

Can you confirm this?

Hey Rich, I was just trying to deal with one error at a time...

I've added some filters to logstash to remove some of the ones with unique ID's and there still seems to be a lot in there, so I'm starting to think the 'rare' function may not be the way to go with this.

There has to be a way though, not giving up.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.