Elasticsearch jvm memory outbursts above settings causing oom-kill

Dear Elasticsearch connoisseurs,

We have a repeating issue in our clusters of nodes suddenly exiting due to the java process being oom-killed.
Let's take the example of this falling node :

  • 94.3 Go of RAM
  • 8 CPUs
  • SWAP deactivated

Our jvm.options config file is the default one, apart from this custom setting :
-Xms40g
-Xmx40g

Monitoring the node in normal time, the used memory stays at the limit of 40Go with the only RAM consumming processes being java : /usr/share/elasticsearch/jdk/bin/java.

But after 30 min to some hours, the RAM begins to increase very quickly and reaches 90Go in ~ 30s. The java process is then being oom-killed.
During this increase, the htop doesn't even show an increase of the RAM used by the java process or any process, it stays at 40Go. But still the overall used RAM is more than 90 Go...

I would be so grateful for a little help.

Here is the last demsg during an oom-kill on this node :

[186243.224511] elasticsearch[B invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[186243.224515] CPU: 2 PID: 58795 Comm: elasticsearch[B Tainted: G             L    5.10.0-16-amd64 #1 Debian 5.10.127-2
[186243.224516] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[186243.224521] Call Trace:
[186243.224530]  dump_stack+0x6b/0x83
[186243.224533]  dump_header+0x4a/0x1f0
[186243.224535]  oom_kill_process.cold+0xb/0x10
[186243.224538]  out_of_memory+0x1bd/0x4e0
[186243.224541]  __alloc_pages_slowpath.constprop.0+0xb8c/0xc60
[186243.224543]  __alloc_pages_nodemask+0x2da/0x310
[186243.224545]  pagecache_get_page+0x16d/0x380
[186243.224546]  filemap_fault+0x69e/0x900
[186243.224569]  ext4_filemap_fault+0x2d/0x40 [ext4]
[186243.224571]  __do_fault+0x37/0x170
[186243.224573]  handle_mm_fault+0x11e7/0x1bf0
[186243.224715]  do_user_addr_fault+0x1b8/0x3f0
[186243.224751]  ? switch_fpu_return+0x40/0xb0
[186243.224755]  exc_page_fault+0x78/0x160
[186243.224757]  ? asm_exc_page_fault+0x8/0x30
[186243.224758]  asm_exc_page_fault+0x1e/0x30
[186243.224788] RIP: 0033:0x7fa769eb7ad8
[186243.224792] Code: Unable to access opcode bytes at RIP 0x7fa769eb7aae.
[186243.224793] RSP: 002b:00007f9c99ab5610 EFLAGS: 00010246
[186243.224794] RAX: ffffffffffffff92 RBX: 00007f9c99ab5670 RCX: 00007fa769eb7ad8
[186243.224795] RDX: 0000000000000000 RSI: 0000000000000089 RDI: 00007fa7666160e8
[186243.224796] RBP: 00007fa7666160c0 R08: 0000000000000000 R09: 00000000ffffffff
[186243.224796] R10: 00007f9c99ab5700 R11: 0000000000000246 R12: 0000000000000000
[186243.224797] R13: 00007fa766616098 R14: 00007f9c99ab5700 R15: 00007fa7666160e8
[186243.224799] Mem-Info:
[186243.224802] active_anon:1166 inactive_anon:11029737 isolated_anon:0
                 active_file:138 inactive_file:82 isolated_file:0
                 unevictable:0 dirty:5 writeback:0
                 slab_reclaimable:5542 slab_unreclaimable:10545
                 mapped:2209 shmem:4250 pagetables:22259 bounce:0
                 free:114789 free_pcp:67 free_cma:0
[186243.224804] Node 0 active_anon:4664kB inactive_anon:44118948kB active_file:552kB inactive_file:328kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:8836kB dirty:20kB writeback:0kB shmem:17000kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 43511808kB writeback_tmp:0kB kernel_stack:4640kB all_unreclaimable? no
[186243.224805] Node 0 DMA free:15908kB min:8kB low:20kB high:32kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[186243.224808] lowmem_reserve[]: 0 2964 96528 96528 96528
[186243.224810] Node 0 DMA32 free:375972kB min:2072kB low:5108kB high:8144kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:3063680kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[186243.224812] lowmem_reserve[]: 0 0 93563 93563 93563
[186243.224814] Node 0 Normal free:67276kB min:67544kB low:163352kB high:259160kB reserved_highatomic:0KB active_anon:4664kB inactive_anon:44118948kB active_file:644kB inactive_file:420kB unevictable:0kB writepending:20kB present:97517568kB managed:95814116kB mlocked:0kB pagetables:89036kB bounce:0kB free_pcp:268kB local_pcp:268kB free_cma:0kB
[186243.224816] lowmem_reserve[]: 0 0 0 0 0
[186243.224818] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[186243.224825] Node 0 DMA32: 3*4kB (UM) 4*8kB (UM) 3*16kB (U) 4*32kB (UM) 9*64kB (UM) 9*128kB (UM) 4*256kB (M) 1*512kB (U) 2*1024kB (UM) 3*2048kB (M) 89*4096kB (ME) = 376220kB
[186243.224859] Node 0 Normal: 1726*4kB (UME) 441*8kB (UME) 267*16kB (UME) 162*32kB (UME) 416*64kB (UME) 167*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 67888kB
[186243.224890] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[186243.224891] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[186243.224891] 4440 total pagecache pages
[186243.224892] 0 pages in swap cache
[186243.224893] Swap cache stats: add 0, delete 0, find 0/0
[186243.224894] Free swap  = 0kB
[186243.224894] Total swap = 0kB
[186243.224895] 25165694 pages RAM
[186243.224895] 0 pages HighMem/MovableOnly
[186243.224896] 442268 pages reserved
[186243.224899] 0 pages hwpoisoned
[186243.224900] Tasks state (memory values in pages):
[186243.224900] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[186243.224914] [    277]     0   277     8499     3119   102400        0          -250 systemd-journal
[186243.224916] [    298]     0   298     5429      303    61440        0         -1000 systemd-udevd
[186243.224917] [    362]     0   362    11936      364    86016        0             0 VGAuthService
[186243.224918] [    363]     0   363    59059      309    81920        0             0 vmtoolsd
[186243.224920] [    366]     0   366     1686       66    53248        0             0 cron
[186243.224921] [    367]   104   367     1995      161    57344        0          -900 dbus-daemon
[186243.224923] [    369]     0   369    55199      907    81920        0             0 rsyslogd
[186243.224924] [    370]     0   370     3407      214    73728        0             0 systemd-logind
[186243.224927] [    559]   110   559     2102      187    57344        0          -500 nrpe
[186243.224929] [    572]     0   572    27179     2097   106496        0             0 unattended-upgr
[186243.224930] [    579]     0   579     1461       28    49152        0             0 agetty
[186243.224931] [    583]   106   583    18624      176    61440        0             0 ntpd
[186243.224933] [    604]     0   604     3338      244    65536        0         -1000 sshd
[186243.224934] [    839]   107   839     4592      246    73728        0             0 exim4
[186243.224936] [  12512]     0 12512    72389     1177   143360        0             0 packagekitd
[186243.224938] [  12516]     0 12516    58375      232    86016        0             0 polkitd
[186243.224941] [  56865]     0 56865     3570      308    69632        0             0 sshd
[186243.224942] [  56868]     0 56868     3797      319    65536        0             0 systemd
[186243.224943] [  56869]     0 56869    42411      955    98304        0             0 (sd-pam)
[186243.224945] [  56889]     0 56889     2232      174    49152        0             0 zsh
[186243.224947] [  58609]   109 58609 12460752 11013923 89460736        0             0 java
[186243.224949] [  58788]   109 58788    27057      149    90112        0             0 controller
[186243.224951] [  58899]     0 58899     3569      303    61440        0             0 sshd
[186243.224952] [  58906]     0 58906     2213      174    61440        0             0 zsh
[186243.224953] [  58909]     0 58909     2534       80    53248        0             0 systemctl
[186243.224955] [  58910]     0 58910     1427       26    45056        0             0 pager
[186243.224956] [  58914]     0 58914     2367      467    61440        0             0 htop
[186243.224957] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/elasticsearch.service,task=java,pid=58609,uid=109
[186243.225054] Out of memory: Killed process 58609 (java) total-vm:49843008kB, anon-rss:44055692kB, file-rss:0kB, shmem-rss:0kB, UID:109 pgtables:87364kB oom_score_adj:0

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.