Understanding "Maximum memory permitted for ML native processes"

At /app/ml/memory_usage I see a marker "Maximum memory permitted for ML native processes" like this at 200MB:

Hovering over anywhere else apart from that marker shows the breakdown like this:
image

Why is the "Maximum memory permitted for ML native processes" pegged at quite a low value which, if things like anomaly detection jobs were to add up to that 200MB, would still leave a further 200MB of unused "estimated available memory"?

Or to put it another way, in the state above, why is it only 95.5mb more that would be allowed for ML, rather than closer to 295.5?

I can see that using it all up would leave no headroom for the JVM heap to grow, but is it hard-coded somewhere to keep 200MB free or is there another explanation?

On such a small node (1GB RAM, though for some reason it shows up as only having 664MB) it seems a bit wasteful to leave a whole 200MB unusable, which is just as much as the 200MB that is allowed to be used.

The reason for this is that when you enable log and metrics collection to a monitoring cluster filebeat and metricbeat are run, and they use up 180MB each from the 1024MB total.

The JVM heap will never grow. The Elasticsearch startup script sets Xmx and Xms to the same value. But yes, you're right that it's hardcoded somewhere to keep 200MB free on dedicated ML nodes over and above what the JVM heap uses. The Java process can use native memory outside of the JVM heap, for example for direct buffers used in networking. There's also a small process called controller that runs on all Elasticsearch nodes. 200MB might seem like a lot to reserve, but the effect of nodes running out of memory and the Linux kernel deciding to kill off processes as a result is catastrophic, so we take measures to avoid this.

The biggest issue here is really the 360MB consumed by filebeat and metricbeat because you've enabled log and metrics collection. Perhaps we should disallow 1GB nodes when this option is selected, because, as you've found, on such a small node there isn't much space left over for anything else.

I see, that all makes sense thanks so much for explaining - and I fully appreciate that 1GB is not a lot to work with so this is fine.

It's still sufficieint for the handful of anomaly detection jobs we run though so it's definitely still worthwhile having the 1GB node, I was just trying to understand things.

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.