Enhancements for ECE system containers or memory and CPU contention/limit aware

Hello Elastic people,

Today ECE assumes it has the entire host for its use. For cloud and VMs that might be ok but for on-prem and future systems with a lot of resources it might not be.

The future shows that more cores per CPU will be available (see the upcoming AMD Genoa and Bergamo 128 core/CPU) so the hosts(allocators) will have at disposal far more more(double, compared the previous generation) compute resources in a single box. On Elastic ingest nodes(where most of the parsing is happening anyway) it would make sense to have a custom option to influence the calculated cpu quota for ingest instances. There are some requests already for this.

but also on the ECE system containers.

Some time ago(2021), I requested more or less the same memory and cpu contention mechanisms(cgroups via docker) that are already used for the Elastic Stack to be applied for the ECE system containers.(frc-runners…,etc.) which are unaware of what limits are imposed on containers

ECE System containers are not reading the docker cgroup limits that will limit the /proc/cpuinfo /proc/meminfo /proc/swaps
Ideally the runner app inside the container should not even use /proc/meminfo(does not support cgroups) and instead rely to docker cgroup limits /sys/fs/cgroup/memory/memory.limit_in_bytes

With so many compute resources at our disposal(eg. dual Bergamo 2x128 cores and 12TB RAM or more via CXL) in a single system we might have multiple Filebeat and Logstash instances on the same host but would like to avoid them competing for resources with the ECE system containers

Since there is no public github repo for ECE we need to discuss it here.

@kimchy

would that fit as innovation?

on the same enhancement topic:

"What are the benefits of cgroup v2 that you think would be useful here?"
Please see the following article which nicely tracks all advances.

focuses on simplicity
Friendly to rootless containers - meaning --cpus=2 --cpu-shares=2000 which ECE should use
eBPF-oriented
which takes me not only for device access control but to a better network control

For a more detailed discussion about modern kernel features(cpu.pressure, memory.pressure, and io.pressure) see the thinking behind cgroups v2 here:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#issues-with-v1-and-rationales-for-v2

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.