Resource Utilization Machine Learning

Hello,

I am working on the Machine Learning capability within two of the clusters that I am responsible for. I am trying to figure out the relationship between the size of my nodes, their configuration and model_memory size of each of the jobs that I am currently running to get a better understanding of how many jobs I can run and what size they need to be.

As a quick side note we have two dedicated ML nodes. Both have 16 cpus and 64gb of ram which I believe should be more than enough, but I am hoping to learn more specifically if that theory is solidly founded or not.

We are ingesting a large amount of data daily that the datafeeds have to go through (around 1TB). Since most of it is windows information those are the types of jobs that we are focusing on. I have to play with the model_memory size of each job as they blow through the default settings immmediately.

For starters, is there an equation that allows you to determine how much memory you can give each job access to in order to ensure that the ML nodes are sufficiently powered to execute the jobs. I have seen one equation that is 10 jobs per node, but if you are concerned for HA you have 10 jobs per node minus one node. For instance if you have 4 nodes you would run a max of 30 jobs. I just don't know how big each of those jobs could be.

I also have some questions about the proper configuration of the ML nodes, but I will leave that for later depending on the level of response I get here.

Thanks for your time!
Alex

The 10 jobs per node maximum was in very early releases of ML within the Elastic stack - 5.x I think. I assume you're on a relatively recent version now, so don't worry about that. The xpack.ml.max_open_jobs setting now defaults to 512. If you have set it back to 10 then I suggest increasing the value.

In reality the limiting factor for jobs on a node is available memory. On ML nodes we recommend setting the JVM heap size quite low so that most memory is available for the ML jobs, which run in separate processes outside of the JVM.

On a 64GB node, maybe try setting the JVM heap to 6GB, then set xpack.ml.max_machine_memory_percent to 90. That will then allow 57.6GB to be used for ML jobs, and 0.4GB left over for the OS. (I'm assuming there's no other major software running on these machines, which is best practice with Elasticsearch.)

Then the number of jobs is effectively limited by the available ML memory. The 512 absolute maximum is still there, but, as you found, unlikely to ever be hit as non-trivial jobs need a lot of memory each.

So instead of thinking of numbers of jobs, instead think of the sums of their model_memory_limit. There's also an extra overhead of 10MB per job to cover memory outside of the modelling, and a 30MB one-off per node for loading the ML code. These numbers used to be higher in older versions - we reduced them as we were being overcautious before and it severely limited users with small nodes.

Like you say, to tolerate a node failure and still run all jobs you'll need excess capacity. Effectively you need one node over and above the minimum the sums of your model_memory_limit plus overheads could fit into.

David,

Thanks so much for the quick response and the valuable information! There is only one point that I am a little hazy on. For HA I want model_memory_limit to add up to a figure slightly lower than 57.6gb to be exact?

If I understood that correctly have you found a good way of ball parking how much memory to give each job access to? Right now for whatever reason when I am spinning up a new job elastic is not able to judge/estimate a memory size for the individual job. To date, we have just been adding space to the jobs until the stop hitting the soft or hard limits. Right now I think most of our jobs are at 2gb model_memory_limit.

I have gone through the running ML at scale document a few times and I think that there are a number of things that we can do to make these jobs a little less resource intensive. I guess the first question that comes to mind is, with the limited availability you have to the operating environment does 2gb seem like an appropriate size? We are on elastic 7.16.2 and are ingesting about 600gb/day of winlogbeat traffic and roughly 300gb/day of network traffic. We have 15 open jobs at the moment, and almost all of them have datafeeds that are querying the winlogbeat index.

Is right sizing the model_memory_limit an interative, unscientific process of trial and error or is there something that we are not doing right?

Thanks again for your help,
Alex

The model memory requirements are primarily driven by the number of "splits" (partition_fields and/or by_fields - defined as "entities" here) within a job - because each unique entity being modeled has to have its own individual model. As a rough ballpark, this is about 30kB RAM (sometimes less) per entity. So, if you expect your job to model 1000 entities - you will have about a 30MB model size in RAM. If it were like 300k entities you'd be up near 9GB. If you have high cardinality data sets, it is often easier (or more appropriate) to not model individual entities, but rather switch to population analysis. This will lessen the burden on trying to model and store individual entities when it may be unnecessary to do so.

So, it really isn't about the "volume" of the input data, per se, it is more about the cardinatlity of the analysis.

Compare the model sizes for two jobs that have 1000 entities (clientip). The top one is partitioned on clientip whereas the bottom one is done as a population job:

Rich,

Thanks for getting back to me. Your post is making me realize I have a lot of studying to do, but I am grateful that I at least have some starting points now. I was wondering if you could expand on:

Specifically, do you add partition_fields and by_fields together in order to get the total number of entities? Looking at the jobs we have running we have over 7000 partition_fields and our by_fields are in the millions.

At this point I think we do have high cardinality data sets. At this point while we are trying to figure out what we are doing and are only using the out of the box ML jobs that power some of the detections within the SIEM (all windows at the moment). If that is true it seems like switching to population analysis is a good way to reduce resource requirements, but I have to research whether or not there are drawbacks to using population analysis.

The other question I have is related to the following statement:

Probably a really basic question, but what does "partitioned on mean"? I have been looking through our jobs in order to tease it out on my own (which I am adding to my research list) and have not been able to figure it out yet. Though it seems like "client_ip" is basically the field that the model is being made of maybe??

The following is the same screen shot of one of the jobs that we have running and hopefully that will be helpful.

Thanks,
Alex

"partitioned on" means split the analysis "for every". There are two ways to split - with a partition field and with a by field (see below).

Ok cool - so if you're running the rare process per host ML job (that is a built-in detection rule) then the job configuration shows the following":

    "detectors": [
      {
        "detector_description": "rare process executions on Windows",
        "function": "rare",
        "by_field_name": "process.name",
        "partition_field_name": "host.name"
      }
    ],

This basically means that "find a rare process...consider every process for every host". The by_field is process.name and the partition_field is host.name. From your screenshot, it looks as if you have about 7300 hosts and over 1.6 million (total) process names being tracked by this ML job. So, this is a "double split" because both the by_field and partition_field are being used (they have subtle differences in "how" they cause a split and if you crafted an ML job by hand you may try to figure out the similarities and differences between the two).

But in your case, you've simply enabled a built-in job so you didn't really have much say as to how the job is constructed, you are just seeing the result of deploying the job to cover 7300+ entities (hosts in this case), each host having many processes.

So, you have a couple of options here. If you feel like this job is valuable and if it is working fine on a particular ML node...then let it go. A model memory size of 1.6GB is not that obscene. You could, in theory, create several cloned versions of this job, and have each one operate on a filtered list of hosts (i.e. jobA is for hosts in the LA Data Center where the hostname begins with "LAXPROD..." and jobB is for the St. Louis Data Center where the hostname begins with "STLPROD...", and so on). In general, having more smaller jobs is better than having fewer, giant jobs merely because the jobs and be more easily distributed to run on more nodes. But you probably don't need to go to such lengths here.

If you want to learn more about Elastic ML may I suggest reading my book. Hard copy on Amazon or get a free e-copy here.

2 Likes

Thank you! Things are starting to come together for me a little bit more. I have been hoping to find something like your book. Unfortunately, I am having trouble downloading it at work (stupid proxy...) but I will definitely grab it once I get home.

Now that I understand some of the basic principles a little better moving ahead is going to be a little less daunting. What is the best way of defining the cardinality of an entity as "too high"? In this scenario it seems like the data is being split twice. The cardinality of process.name is about 1700 and host.name is about 4700. That seems like it would be more resource intensive on the system than the analysis only being split once (obviously) but, how much of a hit is it?

I agree with you that splitting the job up seems unnecessary at this point, as we have enough resources to deal with these jobs at the moment. This discussion has really helped me understand what our cluster is capable of. Thanks!

1 Like

I just finished your book the other day. Thanks so much for sharing it. Included in it was the background info that I was hoping to find to understand ML better as well as providing me with a better understanding of how it works within Elastic specifically.

Between your responses here and the book I feel like I have gained knowledge and confidence. That being said, I am going to open another topic. The customer that I work with is using Elastic as a SIEM and now the analysts spend most of their time using Endgame as they are comfortable with it and have had a lot of success with it.

I am trying to use the Security application and ML to augment the things that they are already doing successfully in Endgame. The environment isn't huge, but it is definitely large (roughly 4500 endpoints sending host data along with a fair bit of network traffic as well). I have spent a lot of time within the Security app and feel like I am duplicating Endgame functionality (and poorly).

Now I am trying to learn more about the ML app in hopes of bringing some stuff to the table that adds value to the current solution in a way that the analysts actually bring in into their workflow in a meaningful way. I have been working with the out of the box jobs as a starting point, and am trying to figure out if I will actually be able to make them valuable to the analysts or if I am going to have to build jobs from the ground up. Thanks again.

1 Like