A couple of the ML jobs we are running in our environment have hit their "hard limit". I have expanded the "model_memory_limit" to allow them more resources in order to run, but they continued to hit the hard limit. When I am looking at the "counts" tab for the specific job in the ML interface I see the following "model_bytes_memory_limit" still set to a number that is too low for the model to be able to work. I haven't been able to find a way to manipulate that field, or even research the field very much to see if that is something I should be touching.
In your other post, I explained how to roughly estimate the required memory (using the ~30kB per entity value). You'll need to figure out the rough cardinality of the data set you expect to analyze and adjust accordingly. Otherwise, you can just keep trying increased values until the job stops throwing HARD_LIMIT errors.
Thanks for the response. I finally understand how cardinality affects resource utilization when it comes to a single partition job. In the job that we were discussing in the previous post that is split twice, how does that second split affect resources. To make things easy say the cardinality of each split is 1000. Does the rough math look like 30kb * 1000 + 30kb * 1000 to take into account each of the splits? Or is it different. Thanks.
Not quite that easy. First of all, the memory requirements for a partition field are a little more than for a by field. Partitions are like 25kB-30kB and by fields are like 10kB-15kB. Secondly, in a double split scenario, you have to work out the combinatorics. So, for example, let's say you are analyzing a field called host.name and one called process.name. Let's say you have 50 unique hosts and 100 unique process names (total).
If you did count by process partition=host, then certainly you'll get 50 partitions for sure, but you may not know if all 100 processes exist on every host. If that IS the case then you'll have 5000 (50*100) unique combinations of things that ML has to track (host1:process1, host1:process2,....host50:process100) <---there will be 5000 items in this liist
On the other hand, let's assume the extreme opposite in that 49 hosts have only 1 process running on it and 1 host has 100 processes. There is now only a total of 149 unique combinations (49+100).
Your memory requirement should be therefore on the order of 20kB times the number of unique combinations.
Thanks for getting back to me so quickly, and I appreciate that explanation things are becoming a little less hazy. The following screen shots are from a job that has been grinding my gears. I am going to try to explain what is going on here and if you don't mind illuminating where I am going off the path I would appreciated it. This job follows similar logic to the hypothetical job you described above.
As you can see from the first screen shot this is rare by "process.name" partition_field_name="host.name" so the fields of interest are process.name and host.name. Based on the second screen shot we have 397,973 processes each of which is responsible for 10-15kb as the process.name is the by_field.
Then, we have 4,557 host.names which is the partition_field each of which is holding 25-30kb.
So, the information that we don't have easy access to is how many of those processes are happening on each of the hosts. I am going to try to replicate the logic that you were trying to get across above that describes what is happening: (process1:host1(if the process is present on this host), process1:host2, xxxx, process397,973:host4,557).
The above is really ugly, but it seems like the by_field just at a conservative 10kb per entity would require 4gb of memory. Determining exactly how many of these processes are on each host hurts my brain. I feel like I am missing something here as well, because the logic underlying this job seems to cause a pretty inefficient mathmatical situation. If my understanding isn't flawed.
As an intermediate question: Does the "model" (I don't know if that is the right term to use here) run through every possible permutation based on the way this job is laid out. It seems like the by_field being processed would require it to at least check to see if one of the 5000 hosts is running that process, before moving to the next process where it would have to go through and see if the process is running on that host and if it is then actually make a model for that entity.
I am just going to leave it at that for now, as this is becoming quite verbose....
Since the job sees data in chronological order, you can assume that when the ML job starts, it has no knowledge of how many hosts and how many processes (and thus how many processes per host) to expect (Note: for now ignore the validation step of the ML job being created in the ML UI which actually tries to make some guesses via quick background queries/aggregations).
So, when the ML job starts seeing data, it says - "oh, here's a record for a process named processX on hostname hostY - do I have a hostY partition yet? If so, add processX to the list of processes for hostY. If not, then add the hostY partition and add processX to the list of processes for hostY". Let's say that in the first 15 minutes of inspecting data, the ML job has seen 37 hosts and 946 process/hosts combinations - it still is a long way from the end state, where there are actually like 5000+ hosts and 400,000+ process/host combinations. But, as time goes on, the model will grow as it encounters more of those combinations.
Therefore, setting a model_bytes_memory limit sort of requires you to think about the end state / worst-case scenario