Elasticsearch master node having high memory pressure

We have our elastic cluster set up in 3 zones right now. The current setup is:

  • 6 data nodes - 2 in each zone - having 58GB RAM each
  • 3 master nodes - 1 in each zone - having 8GB RAM each

We are noticing that a master node is consistently having high memory pressure of ~65% whereas the other two are idle. We want to change the master config to such as below so that the main master node has a higher memory.

  • 2 master nodes - 1 in any two zones - having 15 GB RAM each

Please confirm whether there are no concerns with moving to this configuration.

For high availability you need 3 master eligible nodes - 2 is not good. As any of them can become master at any time they need to be the same size. As dedicated master nodes do not hold data and should not serve requests, you can often set the heap size as a larger percentage of total RAM, e.g. 75%.

2 Likes

I wouldn't recommend that. Master nodes sometimes need a bunch of direct memory in addition to their heap. There's a pretty serious risk of getting killed by the OOM killer if you set your heap size to exceed 50% of your total RAM.

I am a bit surprised by this as the general guidance as far as I know always has been that dedicated node types not holding data (master ingest and coordinating-only nodes, possibly also dedicated ML nodes) Do not need to set aside 50% of RAM for the operating system page cache.

If a dedicated master node need to also be limited to 50% of heap due to the amount of off-heap storage, what does this mean for the viability of small master-data nodes given that most of the 50% of RAM that are not used buy the heap might not be available for the OS page cache in order to boost performance? Do you have any figures around how much off-heap memory a dedicated master node uses and how this vary by e.g. cluster size?

This is correct, but does not change the fact that the heap size must be limited to no more than 50% of the available RAM. I think there is some confusion here. At a high level there are three main components to Elasticsearch's memory usage: the heap, direct memory, and filesystem cache (also known as page cache). Heap and direct memory are attributed to the Elasticsearch process; if they add up to more than the available RAM then the process is liable to be killed. Filesystem cache is not attributed to the process since it can all be discarded if needed (with a performance penalty obviously). Non-data nodes do not really need any filesystem cache as you say, but they still need direct memory for networking. Heap size is fixed at startup but direct memory grows and shrinks as needed. In older versions the direct memory size is limited to be no larger than the heap and in newer versions it has a slightly more conservative limit. Thus you need at least twice as much RAM as your heap size (plus overhead in older versions) just for the process, and anything left over is available for the filesystem cache.

The docs on this were adjusted to clarify this recently:

Elasticsearch requires memory for purposes other than the JVM heap and it is important to leave space for this. For instance, Elasticsearch uses off-heap buffers for efficient network communication, relies on the operating system’s filesystem cache for efficient access to files, and the JVM itself requires some memory too.

Here "off-heap buffers" is referring to direct memory, which is distinct from the filesystem cache.

It depends (of course). Since direct memory grows and shrinks as needed, it's possible that most of the spare 50% is used by the filesystem cache, but it's also possible that it's all taken up by networking. I know there have been changes within the 7.x series that affect the profile of direct memory usage but I've not been following the details.

No, I don't have any hard figures for this, except that it's certainly no more than the heap size. I would expect it to be quite spiky - I believe the most expensive time for a master node is when a lot of nodes are all concurrently joining the cluster, e.g. after a network partition heals, since the master must send the full cluster state to each joining node. In normal running I would expect it to be much lower.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.