JVM heap recommendations on k8s cluster

vangap · December 27, 2022, 1:11am

Hi,

I would like to know what is the recommendation about the heap size when operating ES nodes in a k8s cluster. I have referred this and it is not fully clear what is the recommendation. Manage compute resources | Elastic Cloud on Kubernetes [2.5] | Elastic

I am aware of the general recommendation of the 50% rule where xms,xmx should be 50% of the available RAM. Elasticsearch will use the non-heap 50% to store page cases which is managed at kernel level.

However, I am reading that in the case of containers page cache is at host level as it is maintained by kernel. So, does that mean containers can use memory in the form of page cache beyond their "limits"? Does that mean 50% rule is not that meaningful when operating ES as a container?

xeraa · December 27, 2022, 1:52pm

I think there's some confusion there. Containers (in a very simplified way) provide process isolation. Quoting Docker security | Docker Docs

processes running within a container cannot see, and even less affect, processes running in another container, or in the host system.

While mmap is a (Kernel) system call, is mapped as part of the container:

mmap() creates a new mapping in the virtual address space of the calling process.

So the 50% still apply.

With --ipc and setting it to host you could do some funky stuff around sharing memory mapped files between containers. But this is not what you're doing here and wouldn't give you any benefit.

vangap · December 27, 2022, 2:33pm

Thanks @xeraa for the reply.

I am by no means an expert at docker or linux.
I landed upon this SO post which doesn't necessarily be the source of truth.

this suggests that page cache is shared by containers. Docker documentation also says that for overlay storage drive, page cache is shared between containers. Use the OverlayFS storage driver | Docker Documentation

So, it seems even though the mmap mapping is created per container.. they map to same underlying page cache

In the context of elasticsearch, there would never be a need for sharing page caches.. but what I want to understand is the impact on the 50% rules if the page cache falls outside the realm of containers (and the container memory limits)ˀ

xeraa · December 27, 2022, 7:11pm

I think we are talking about different things: do you need to allocate the memory in the container vs can a memory mapped file be shared

Small experiment with a container that has 512M heap. Elasticsearch wouldn't even start for me with 850M for the entire container. Even with no data, the memory use on this container is on the very high side. docker stats <id> puts me to around 93% memory usage; BTW setting swap to the same as memory disables swap, since this would only make it more confusing:

Run the container: docker run --memory=950M --memory-swap=950M -e ES_JAVA_OPTS="-Xms512m -Xmx512m" --publish 9200:9200 -it docker.elastic.co/elasticsearch/elasticsearch:8.5.2

Then you can check the memory usage with curl -k -u elastic "https://localhost:9200/_nodes/stats/jvm?human" | jq. Relevant part of the output:

  "mem": {
      "heap_used": "212.5mb",
      "heap_used_in_bytes": 222826160,
      "heap_used_percent": 41,
      "heap_committed": "512mb",
      "heap_committed_in_bytes": 536870912,
      "heap_max": "512mb",
      "heap_max_in_bytes": 536870912,
      "non_heap_used": "163.4mb",
      "non_heap_used_in_bytes": 171358840,
      "non_heap_committed": "168.5mb",
      "non_heap_committed_in_bytes": 176750592,

Running this command inside the container you can also see the memory use (RSS):

sh-5.0$ ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
elastic+   154  0.0  0.0   3752  2944 pts/1    Ss   19:06   0:00 /bin/sh
elastic+   160  0.0  0.0   5480  2320 pts/1    R+   19:06   0:00  \_ ps faux
elastic+     1  0.0  0.0   1936   448 pts/0    Ss   19:03   0:00 /bin/tini -- /usr/local/bin/docker-entrypoint.sh eswrapper
elastic+     7  4.0  2.6 2612888 112200 pts/0  Sl+  19:03   0:06 /usr/share/elasticsearch/jdk/bin/java -Xms4m -Xmx64m -XX:+UseSerialGC -Dcli.name=
elastic+    66 15.0 19.3 3820776 829672 pts/0  Sl+  19:03   0:22  \_ /usr/share/elasticsearch/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.n
elastic+    88  0.0  0.1 103480  6272 pts/0    Sl+  19:03   0:00      \_ /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-aarch64/bin/con

So yes, you will need to allocate that memory within the container.

vangap · December 28, 2022, 3:52am

Hi,

I am referring to the page cache (aka file system cache) and not swap memory. Upon further reading, this is what I think is the conclusion...

50% rule still applies for Elasticsearch.
Page cache is counted against the container even though it is the kernel that manages it.
But, page cache is also shared across containers if docker overlay2 storage driver is being used which is the default I think now? But the gotcha is that the page cache would be accounted in equal proportions if multiple containers are accessing same files. Since sharing of files is not a scenario for Elasticsearch, we don't need to worry about its impact on memory calculations.

Accounting for memory in the page cache is very complex. If two processes in different control groups both read the same file (ultimately relying on the same blocks on disk), the corresponding memory charge is split between the control groups. It’s nice, but it also means that when a cgroup is terminated, it could increase the memory usage of another cgroup, because they are not splitting the cost anymore for those memory pages.

I still lack some clarity on how these memories are reported, but I am convinced that 50% rule applies

Thanks.

xeraa · December 28, 2022, 3:02pm

A data directory can only be used by a single Elasticsearch instance. Try using the same bind-mount in two containers and you'll see the second one fail because the data directory is already locked.

So, in the context of Elasticsearch, sharing an mmap'ed file isn't really a thing. And why the 50% rule (it's an approximation, it could even make sense to have less heap than that in some situations) will apply just like on any other installation method.

Don't get too sidetracked what mmap can theoretically do if it doesn't make sense in the context of Elasticsearch.

system · January 25, 2023, 3:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.