Mmapfs and impact of vm.max_map_count?

This with reference to

What are the pros and cons of increasing vm.max_map_count from 64k to 256k?

Does 64k vm.max_map_count imply --> 64k addresses * 64kb page size = upto 4GB of lucene index data can be referenced by ES which is actually residing on FS cache?

And if i exceed 4GB - the adressable space due to the max_map_count limit, will OS need to page out some of the older accessed index data?

Maybe my above undertsanding is not correct as FS cache can use up remaining memory of say 16GB to store lucene data. So i am not sure what exactly the vm.max_map_count limit does

How does this limit result in OOM?

I also read which had good explanation but still could not fully understand the full role and impact of increasing vm.max_map_count

Answering my own question based on further digging and reply from Uwe Schindler - Lucene PMC

The page size has nothing to do with the max_map_count. It is the number of mappings that are allocated. Lucene's MMapDirectory maps in portions of up to 1 GiB. The number of mappings is therefor dependent on the number of segments (number of files in the index directory) and their size. A typical index with like 40 files in index directory, all of them smaller than 1 GiB needs 40 mappings. If the index is larger, has 40 files and most segments have like 20 Gigabytes, then it could take up to 800 mappings.

The reson why Elasticsearch people recommend to raise max_map_count is because of their customer structure. Most Logstash users have Elasticsearch clouds with like 10,000 indexes each
possibly very large, so the number of mapping could get a limiting

I'd suggest to not change the default setting, unless you get IOExceptions about "map failed" (please note: it will not result in OOMs with recent Lucene versions as this is handled internally!!!!)

The paging of the OS has nothing to do with the mapped file count. The max_map_count is just a limit on how many mappings in total can be used. A mapping needs one chunk of up to 1 GiB that is mmapped. Paging in the OS happens on a much lower level, it will swap any part according to the page size of those chunks independently: chunk != page size

Summary - Please correct me if I am wrong, unlike what the documentation suggests. Dont think it is required to increase max_map_count in all scenarios

ES 2.x -
In the default (hybrid nio +mmap) FS mode only the .dvd and .tim files (maybe point too) are mmaped and that would allow for ~30000 shards per node.

ES 5.x - there is segment throttling so although default moves to mmapfs, the default of 64k may still work fine.

This could be useful if you plan to use mmapfs and have > 1000 shards per node. ( i personally see many toher issues creep in with high shards/node)

mmapfs store - only when the store is mmapfs and each node stores > 65000 segment files (or 1000+ shards) will this limit come in. I would rather add more nodes than have such massive number of shards per node on mmapfs

You can only see this as an approximation. One shard does not map to one (or two) virtual memory areas. As described by Uwe, each shard has a number of segments, and each segment contains at least one memory-mapped file. Also, while merging, you may have extra segments, they come and go.

Furthermore, and this is more subtle, vm.max_map_count is a limit for virtual memory areas (VMAs) of the Linux Kernel, which are the result of mmap()/malloc() calls. The reality is, memory allocations may use one or more VMAs, because of size (Lucene portion of 1GiB) or permissions. So, large Lucene files take more than one VMA. You can check this by cat /proc/<ES Process ID>/maps | grep Lucene (e.g. on my system I can see 594 VMAs for 290 shards). Beside this, the Linux kernel can also merge VMAs on contiguous memory areas, effectively reducing the count of VMAs. Example: transparent huge pages (THP), which is enabled by default in RHEL/Centos 7.

The 65530 limit of max_map_count is typically exceeded by the classic "Logstash use case": users see plenty of RAM left free while hitting OOM when creating log messages in vast amounts of different time-windowed indices on single node with large RAM like 64GB+

An increased value for max_map_count makes life easier for ES support in order to let users run Logstash-like use cases on machines with 64GB RAM or more. The impact is only a bit more memory usage by the kernel for managing more VMAs, but performance is not affected.

For my use cases, I find the default value 65530 is more than sufficient.

Segment throttling has only a very indirect relation to vm.max_map_count. Maybe you mean the number of created segment files which are mmap()ed per time unit. The segment creations are automatically adjusted by both ES 2.x and ES 5.x afaik.

Thanks, this is very helpful to know! I was trying to gauge the use cases and impact. So maybe an approximate rule I will go with is if RAM > 4GB, max_map_count can be increased.