Elasticsearch MMAPFS -> Swap

icm · July 4, 2020, 8:10am

Hi,

At our company we have a 6.1 cluster. To prepare for the upgrade to 6.8+ onwards, we've started rolling out new machines to Focal Fossa / Ubuntu 20.04 instead of Ubuntu 18.04.

The cluster consists of 4 data nodes, 3 search nodes, 3 masters. The 4 data nodes are named data-01 / data-02 / data-03 / data-04 and split into two pools via node attributes: odd: pool1 | even: pool2.

The JDK version is the same, latest OpenJDK from same PPA.

The "maybe memleak" triggers on nodes 3 / 4 (pool 1 & 2), both with Focal Fossa (20.04) - but not on nodes 1 / 2 (pool 1 & 2 ) with Bionic Beaver (18.04)

Maybe memleak means: the nodes start utilizing swapspace, but not in an explosive way, but rather an slow and stady pace (~ 2 GB in 12 hours).

All nodes use MMAPFS. From my point of understanding, utilizing the MMAPFS store means...

64 GB of RAM per Node:

You've heap, which is pre touched (-XX:+AlwaysPreTouch), used by Elasticsearch's Java process (~28 GB) and it's mlocked, cannot be swapped
You've Lucene's index files on disk mmapped, hence it's outside of the mlocked heap
Via MMAP, Elasticsearch/Lucene can access the files and read the necessary index data in RAM

Since swap was involved, I utilized smem and started analyzing which segments of memory were used in swap. The 2 GB of swap space belong to elasticsearch and MMAP.

This led to a discussion in our team wether or not this is correct behaviour.

As far as I understand it:
Lucene "indirectly" just gobbles up as much RAM as it can via MMAP and since we currently (due to the migration) are not writing new data, the indexes are not modified - hence they won't be closed and the RAM won't get freed.

It's just a sad coincidence it happens on the Focal Fossa machines.

Is this statement correct? If no RAM is available, Lucene starts utilizing the swap?

(vm.swappiness is 1, vm.overcommit_memory 0 since I think that limiting the RAM would lead to exceptions and a dead elasticsearch process).

The only alternative would be to switch to NIOFS, since 6.1 doesn't have hybrid fs (as in 6.8, backported from 7).

NIOFS could be an in drop replacement, since we don't change via index.store -> niofs the files on disk, rather change the way the files are accessed, correct?

Thanks if someone could clear that up

DavidTurner · July 4, 2020, 10:46am

This doesn't sound right. Mmapped files shouldn't be consuming any swap. They're already on disk so their pages can simply be dropped if the space is needed for other things.

Could you share more details of your analysis here?

Also note that you're very strongly encouraged to disable swap. It causes a lot more problems than it solves.

icm · July 4, 2020, 11:30am

Hi,

yes of course. currently the node is at ~ 1.1 G swap.

gist.github.com

https://gist.github.com/Nezisi/266b528e6c9da9585a652787bf96a4ad

smem.txt

   PID User             Command                                                                                                                                                        Swap     USS     PSS     RSS 
  1918 elasticsearch    /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/controller                                                                                 640.0K    4.0K  135.0K    2.0M 
   925 root             /sbin/agetty -o -p -- \u --noclear tty1 linux                                                                                                                     0  308.0K  401.0K    2.0M 
   867 root             /usr/sbin/cron -f                                                                                                                                                 0  308.0K  549.0K    2.9M 
   686 _rpc             /sbin/rpcbind -f -w                                                                                                                                               0  484.0K  863.0K    3.6M 
   878 nagios           /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -f -4                                                                                                                      0  672.0K    1.2M    4.4M 
   883 root             /usr/sbin/qemu-ga                                                                                                                                                 0    1.3M    1.4M    3.5M 
   697 systemd-timesync /lib/systemd/systemd-timesyncd                                                                                                                                    0  944.0K    1.5M    6.0M 
   869 messagebus       /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only                                                          0    1.4M    1.6M    4.2M 
   431 root             /lib/systemd/systemd-udevd                                                                                                                                        0    1.6M    1.8M    4.5M

This file has been truncated. show original

smem_java.txt

Map                                       PIDs   AVGPSS      PSS 
/data/es1/nodes/0/indices/-9Ud4hs-RqCdlf     1        0        0 
/data/es1/nodes/0/indices/-9Ud4hs-RqCdlf     1        0        0 
/data/es1/nodes/0/indices/-9Ud4hs-RqCdlf     1        0        0 
/data/es1/nodes/0/indices/-9Ud4hs-RqCdlf     1        0        0 
/data/es1/nodes/0/indices/-9Ud4hs-RqCdlf     1        0        0 
/data/es1/nodes/0/indices/-9Ud4hs-RqCdlf     1        0        0 
/data/es1/nodes/0/indices/-9Ud4hs-RqCdlf     1        0        0 
/data/es1/nodes/0/indices/-9Ud4hs-RqCdlf     1        0        0 
/data/es1/nodes/0/indices/-9Ud4hs-RqCdlf     1        0        0

This file has been truncated. show original

vm_sysctl.txt

vm.admin_reserve_kbytes = 8192
vm.block_dump = 0
vm.compact_unevictable_allowed = 1
vm.dirty_background_bytes = 268435456
vm.dirty_background_ratio = 0
vm.dirty_bytes = 1073741824
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 0
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200

This file has been truncated. show original

There are more than three files. show original

In this gist, I've pasted the

smem overview
smem mappings for java process
vmstat
vmctl

We've 4 G of swap only for emergency cases like that, hence the vm.swappiness = 1.

Thank you very much for your help.

DavidTurner · July 4, 2020, 12:17pm

I don't see anything here to indicate that any mmapped files are being swapped out. Can you share the full contents of /proc/874/smaps (where 874 is Elasticsearch's PID)?

icm · July 4, 2020, 12:35pm

Sure... Then I must have misunderstood the smem reports.

That cleans up at least the SWAP part

https://gist.githubusercontent.com/Nezisi/41bb486db3561d15b142d1859a8a06ff/raw/8699856ed1d947046a7f6f6eb65dd237856c1648/smaps.txt

Warning: It's 13 MB large.

Thanks!

DavidTurner · July 4, 2020, 12:53pm

Indeed, no mmapped files are consuming any swap here:

$ cat smaps.txt | grep -e '/data/es1/' -A22 | grep Swap: | uniq -c
6469 Swap:                  0 kB

icm · July 4, 2020, 1:29pm

Yeah... You're right. Based on your grep, i used an regex to extract everything with a swap > 9 kb:

grep -B18 -A4 -Pi 'Swap:\s+([\d]{2,})' smaps.txt

Output is here:

gist.github.com

https://gist.github.com/Nezisi/f9177f8eb01847b1bebfcbf28c047a4c

smaps_swap.txt

7c01e0000-7c0a81000 rw-p 00000000 00:00 0 
Size:               8836 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                8684 kB
Pss:                8684 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      8684 kB

This file has been truncated. show original

A calculation shows that the regex isn't completely bogus:

echo $((`grep 'Swap: ' smaps_swap.txt | awk '{ print $2; }' | tr '\n' '+'`0)) 1335656

Is there any way to find out why / what these addresses refer to?

DavidTurner · July 4, 2020, 1:58pm

Not really, no, they're just various bits of dynamically-allocated memory that isn't being used much so the OS has decided to swap them out.

system · August 1, 2020, 1:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Would turning off swap require extra configuration Elasticsearch	3	205	February 16, 2023
Memory Mapped files using full RAM leaving very less room for other processes Elasticsearch	2	1414	March 26, 2021
Swap usage on Elastic Search Node Elasticsearch	8	2021	July 5, 2017
Data nodes leaving the cluster randomly Elasticsearch	4	881	July 6, 2017
MIN/MAX memory allocation and mlockall Elasticsearch	4	676	July 6, 2017

Elasticsearch MMAPFS -> Swap

Related topics