Elasticsearch MMAPFS -> Swap


At our company we have a 6.1 cluster. To prepare for the upgrade to 6.8+ onwards, we've started rolling out new machines to Focal Fossa / Ubuntu 20.04 instead of Ubuntu 18.04.

The cluster consists of 4 data nodes, 3 search nodes, 3 masters. The 4 data nodes are named data-01 / data-02 / data-03 / data-04 and split into two pools via node attributes: odd: pool1 | even: pool2.

The JDK version is the same, latest OpenJDK from same PPA.

The "maybe memleak" triggers on nodes 3 / 4 (pool 1 & 2), both with Focal Fossa (20.04) - but not on nodes 1 / 2 (pool 1 & 2 ) with Bionic Beaver (18.04)

Maybe memleak means: the nodes start utilizing swapspace, but not in an explosive way, but rather an slow and stady pace (~ 2 GB in 12 hours).

All nodes use MMAPFS. From my point of understanding, utilizing the MMAPFS store means...

64 GB of RAM per Node:

  • You've heap, which is pre touched (-XX:+AlwaysPreTouch), used by Elasticsearch's Java process (~28 GB) and it's mlocked, cannot be swapped
  • You've Lucene's index files on disk mmapped, hence it's outside of the mlocked heap
  • Via MMAP, Elasticsearch/Lucene can access the files and read the necessary index data in RAM

Since swap was involved, I utilized smem and started analyzing which segments of memory were used in swap. The 2 GB of swap space belong to elasticsearch and MMAP.

This led to a discussion in our team wether or not this is correct behaviour.

As far as I understand it:
Lucene "indirectly" just gobbles up as much RAM as it can via MMAP and since we currently (due to the migration) are not writing new data, the indexes are not modified - hence they won't be closed and the RAM won't get freed.

It's just a sad coincidence it happens on the Focal Fossa machines.

Is this statement correct? If no RAM is available, Lucene starts utilizing the swap?

(vm.swappiness is 1, vm.overcommit_memory 0 since I think that limiting the RAM would lead to exceptions and a dead elasticsearch process).

The only alternative would be to switch to NIOFS, since 6.1 doesn't have hybrid fs (as in 6.8, backported from 7).

NIOFS could be an in drop replacement, since we don't change via index.store -> niofs the files on disk, rather change the way the files are accessed, correct?

Thanks if someone could clear that up :slight_smile:

This doesn't sound right. Mmapped files shouldn't be consuming any swap. They're already on disk so their pages can simply be dropped if the space is needed for other things.

Could you share more details of your analysis here?

Also note that you're very strongly encouraged to disable swap. It causes a lot more problems than it solves.

1 Like


yes of course. currently the node is at ~ 1.1 G swap.

In this gist, I've pasted the

  • smem overview
  • smem mappings for java process
  • vmstat
  • vmctl

We've 4 G of swap only for emergency cases like that, hence the vm.swappiness = 1.

Thank you very much for your help.

I don't see anything here to indicate that any mmapped files are being swapped out. Can you share the full contents of /proc/874/smaps (where 874 is Elasticsearch's PID)?

1 Like

Sure... Then I must have misunderstood the smem reports.

That cleans up at least the SWAP part :wink:


Warning: It's 13 MB large.


Indeed, no mmapped files are consuming any swap here:

$ cat smaps.txt | grep -e '/data/es1/' -A22 | grep Swap: | uniq -c
6469 Swap:                  0 kB
1 Like

Yeah... You're right. Based on your grep, i used an regex to extract everything with a swap > 9 kb:

grep -B18 -A4 -Pi 'Swap:\s+([\d]{2,})' smaps.txt

Output is here:

A calculation shows that the regex isn't completely bogus:

echo $((`grep 'Swap: ' smaps_swap.txt | awk '{ print $2; }' | tr '\n' '+'`0)) 1335656

Is there any way to find out why / what these addresses refer to?

Not really, no, they're just various bits of dynamically-allocated memory that isn't being used much so the OS has decided to swap them out.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.