Elasticsearch 8: new OOM kills in comparison with ES7

We test Elasticsearch 8.7.0 cluster setup (to migrate from ES 7.17.9)
and we face the problem that the voting-only node in the testing cluster
is sometimes killed by OOM.

We use almost identical setup like in our current production ES7 cluster,
where we never encounter OOM, which is quite weird.

ES7 setup (never encounter OOM):

  • Version: Elasticsearch 7.17.9
  • OS: Ubuntu 22.04.1 LTS, kernel: 5.15.0-1030-aws
  • Hardware (EC2):
    2x r6g.large (arm64, 2 vCPU, 16 GiB RAM) (data nodes)
    1x t4g.small (arm64, 2 vCPU, 2 GiB RAM) (voting-only node)
  • amazon-cloudwatch-agent: version 1.247357.0b252275

ES8 setup (sometimes encounter OOM on the voting-only node)

  • Version: Elasticsearch 8.7.0
  • OS: Ubuntu 22.04.2 LTS, kernel: 5.15.0-1031-aws
  • Hardware (EC2):
    2x r6g.large (arm64, 2 vCPU, 16 GiB RAM) (data nodes)
    1x t4g.small (arm64, 2 vCPU, 2 GiB RAM) (voting-only node)
  • amazon-cloudwatch-agent: version 1.247358.0b252413

The only other application running on nodes is amazon-cloudwatch-agent for
collection of logs and metrics, which has the same memory footprint.

Also, the memory footprint of both Elasticsearch instances on voting-only nodes
is similar, but not entirely the same (ES8 node spawns two processes):

ES7 production setup memory footprint:
elastic+ process 78.6% RAM
ES8 testing setup memory footprint:
elastic+ process 79.9% RAM
elastic+ process 4.7% RAM

We could increase RAM for the voting-only node to 4GB, but first we would like to know more
what is going on.

Do you think Elasticsearch 8 or it's bundled Java could cause newly encountered
invocations of OOM killer?

Can you share the kernel logs from an example OOM-killer invocation? It'll be 150-ish lines in the dmesg output that starts with a line containing invoked oom-killer and ends with one containing Killed process.

Also the output of GET _nodes/jvm (when the voting-only node is running normally)

Hi David, sure, here:

{
	"_nodes": {
		"total": 3,
		"successful": 3,
		"failed": 0
	},
	"cluster_name": "cluster-company-es8",
	"nodes": {
		"CLw8HuhCTZ20dS9uii_Vsg": {
			"name": "x-elastic1.company.internal",
			"transport_address": "10.0.42.89:9300",
			"host": "10.0.42.89",
			"ip": "10.0.42.89",
			"version": "8.7.0",
			"build_flavor": "default",
			"build_type": "deb",
			"build_hash": "09520b59b6bc1057340b55750186466ea715e30e",
			"roles": [
				"data",
				"master"
			],
			"attributes": {
				"aws_availability_zone": "eu-central-1a",
				"xpack.installed": "true"
			},
			"jvm": {
				"pid": 789,
				"version": "19.0.2",
				"vm_name": "OpenJDK 64-Bit Server VM",
				"vm_version": "19.0.2+7-44",
				"vm_vendor": "Oracle Corporation",
				"using_bundled_jdk": true,
				"start_time_in_millis": 1681466101902,
				"mem": {
					"heap_init_in_bytes": 8271167488,
					"heap_max_in_bytes": 8271167488,
					"non_heap_init_in_bytes": 7667712,
					"non_heap_max_in_bytes": 0,
					"direct_max_in_bytes": 0
				},
				"gc_collectors": [
					"G1 Young Generation",
					"G1 Old Generation"
				],
				"memory_pools": [
					"CodeHeap 'non-nmethods'",
					"Metaspace",
					"CodeHeap 'profiled nmethods'",
					"Compressed Class Space",
					"G1 Eden Space",
					"G1 Old Gen",
					"G1 Survivor Space",
					"CodeHeap 'non-profiled nmethods'"
				],
				"using_compressed_ordinary_object_pointers": "true",
				"input_arguments": [
					"-Des.networkaddress.cache.ttl=60",
					"-Des.networkaddress.cache.negative.ttl=10",
					"-Djava.security.manager=allow",
					"-XX:+AlwaysPreTouch",
					"-Xss1m",
					"-Djava.awt.headless=true",
					"-Dfile.encoding=UTF-8",
					"-Djna.nosys=true",
					"-XX:-OmitStackTraceInFastThrow",
					"-Dio.netty.noUnsafe=true",
					"-Dio.netty.noKeySetOptimization=true",
					"-Dio.netty.recycler.maxCapacityPerThread=0",
					"-Dlog4j.shutdownHookEnabled=false",
					"-Dlog4j2.disable.jmx=true",
					"-Dlog4j2.formatMsgNoLookups=true",
					"-Djava.locale.providers=SPI,COMPAT",
					"--add-opens=java.base/java.io=ALL-UNNAMED",
					"-XX:+UseG1GC",
					"-Djava.io.tmpdir=/tmp/elasticsearch-5657180311228775275",
					"-XX:+HeapDumpOnOutOfMemoryError",
					"-XX:+ExitOnOutOfMemoryError",
					"-XX:HeapDumpPath=/var/lib/elasticsearch",
					"-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log",
					"-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m",
					"-Xms7888m",
					"-Xmx7888m",
					"-XX:MaxDirectMemorySize=4135583744",
					"-XX:G1HeapRegionSize=4m",
					"-XX:InitiatingHeapOccupancyPercent=30",
					"-XX:G1ReservePercent=15",
					"-Des.distribution.type=deb",
					"--module-path=/usr/share/elasticsearch/lib",
					"--add-modules=jdk.net",
					"-Djdk.module.main=org.elasticsearch.server"
				]
			}
		},
		"1YX-VUN3S7-6mpioiN1zQQ": {
			"name": "x-elastic2.company.internal",
			"transport_address": "10.0.87.233:9300",
			"host": "10.0.87.233",
			"ip": "10.0.87.233",
			"version": "8.7.0",
			"build_flavor": "default",
			"build_type": "deb",
			"build_hash": "09520b59b6bc1057340b55750186466ea715e30e",
			"roles": [
				"data",
				"master"
			],
			"attributes": {
				"aws_availability_zone": "eu-central-1b",
				"xpack.installed": "true"
			},
			"jvm": {
				"pid": 790,
				"version": "19.0.2",
				"vm_name": "OpenJDK 64-Bit Server VM",
				"vm_version": "19.0.2+7-44",
				"vm_vendor": "Oracle Corporation",
				"using_bundled_jdk": true,
				"start_time_in_millis": 1681466101113,
				"mem": {
					"heap_init_in_bytes": 8271167488,
					"heap_max_in_bytes": 8271167488,
					"non_heap_init_in_bytes": 7667712,
					"non_heap_max_in_bytes": 0,
					"direct_max_in_bytes": 0
				},
				"gc_collectors": [
					"G1 Young Generation",
					"G1 Old Generation"
				],
				"memory_pools": [
					"CodeHeap 'non-nmethods'",
					"Metaspace",
					"CodeHeap 'profiled nmethods'",
					"Compressed Class Space",
					"G1 Eden Space",
					"G1 Old Gen",
					"G1 Survivor Space",
					"CodeHeap 'non-profiled nmethods'"
				],
				"using_compressed_ordinary_object_pointers": "true",
				"input_arguments": [
					"-Des.networkaddress.cache.ttl=60",
					"-Des.networkaddress.cache.negative.ttl=10",
					"-Djava.security.manager=allow",
					"-XX:+AlwaysPreTouch",
					"-Xss1m",
					"-Djava.awt.headless=true",
					"-Dfile.encoding=UTF-8",
					"-Djna.nosys=true",
					"-XX:-OmitStackTraceInFastThrow",
					"-Dio.netty.noUnsafe=true",
					"-Dio.netty.noKeySetOptimization=true",
					"-Dio.netty.recycler.maxCapacityPerThread=0",
					"-Dlog4j.shutdownHookEnabled=false",
					"-Dlog4j2.disable.jmx=true",
					"-Dlog4j2.formatMsgNoLookups=true",
					"-Djava.locale.providers=SPI,COMPAT",
					"--add-opens=java.base/java.io=ALL-UNNAMED",
					"-XX:+UseG1GC",
					"-Djava.io.tmpdir=/tmp/elasticsearch-16872787292984437309",
					"-XX:+HeapDumpOnOutOfMemoryError",
					"-XX:+ExitOnOutOfMemoryError",
					"-XX:HeapDumpPath=/var/lib/elasticsearch",
					"-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log",
					"-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m",
					"-Xms7888m",
					"-Xmx7888m",
					"-XX:MaxDirectMemorySize=4135583744",
					"-XX:G1HeapRegionSize=4m",
					"-XX:InitiatingHeapOccupancyPercent=30",
					"-XX:G1ReservePercent=15",
					"-Des.distribution.type=deb",
					"--module-path=/usr/share/elasticsearch/lib",
					"--add-modules=jdk.net",
					"-Djdk.module.main=org.elasticsearch.server"
				]
			}
		},
		"f0d3vgvJRZebyDKn68q-Pg": {
			"name": "x-elastic3.company.internal",
			"transport_address": "10.0.126.136:9300",
			"host": "10.0.126.136",
			"ip": "10.0.126.136",
			"version": "8.7.0",
			"build_flavor": "default",
			"build_type": "deb",
			"build_hash": "09520b59b6bc1057340b55750186466ea715e30e",
			"roles": [
				"master",
				"voting_only"
			],
			"attributes": {
				"aws_availability_zone": "eu-central-1c",
				"xpack.installed": "true"
			},
			"jvm": {
				"pid": 773,
				"version": "19.0.2",
				"vm_name": "OpenJDK 64-Bit Server VM",
				"vm_version": "19.0.2+7-44",
				"vm_vendor": "Oracle Corporation",
				"using_bundled_jdk": true,
				"start_time_in_millis": 1681466099046,
				"mem": {
					"heap_init_in_bytes": 973078528,
					"heap_max_in_bytes": 973078528,
					"non_heap_init_in_bytes": 7667712,
					"non_heap_max_in_bytes": 0,
					"direct_max_in_bytes": 0
				},
				"gc_collectors": [
					"G1 Young Generation",
					"G1 Old Generation"
				],
				"memory_pools": [
					"CodeHeap 'non-nmethods'",
					"Metaspace",
					"CodeHeap 'profiled nmethods'",
					"Compressed Class Space",
					"G1 Eden Space",
					"G1 Old Gen",
					"G1 Survivor Space",
					"CodeHeap 'non-profiled nmethods'"
				],
				"using_compressed_ordinary_object_pointers": "true",
				"input_arguments": [
					"-Des.networkaddress.cache.ttl=60",
					"-Des.networkaddress.cache.negative.ttl=10",
					"-Djava.security.manager=allow",
					"-XX:+AlwaysPreTouch",
					"-Xss1m",
					"-Djava.awt.headless=true",
					"-Dfile.encoding=UTF-8",
					"-Djna.nosys=true",
					"-XX:-OmitStackTraceInFastThrow",
					"-Dio.netty.noUnsafe=true",
					"-Dio.netty.noKeySetOptimization=true",
					"-Dio.netty.recycler.maxCapacityPerThread=0",
					"-Dlog4j.shutdownHookEnabled=false",
					"-Dlog4j2.disable.jmx=true",
					"-Dlog4j2.formatMsgNoLookups=true",
					"-Djava.locale.providers=SPI,COMPAT",
					"--add-opens=java.base/java.io=ALL-UNNAMED",
					"-XX:+UseG1GC",
					"-Djava.io.tmpdir=/tmp/elasticsearch-8889562591392273921",
					"-XX:+HeapDumpOnOutOfMemoryError",
					"-XX:+ExitOnOutOfMemoryError",
					"-XX:HeapDumpPath=/var/lib/elasticsearch",
					"-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log",
					"-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m",
					"-Xms925m",
					"-Xmx925m",
					"-XX:MaxDirectMemorySize=485490688",
					"-XX:G1HeapRegionSize=4m",
					"-XX:InitiatingHeapOccupancyPercent=30",
					"-XX:G1ReservePercent=15",
					"-Des.distribution.type=deb",
					"--module-path=/usr/share/elasticsearch/lib",
					"--add-modules=jdk.net",
					"-Djdk.module.main=org.elasticsearch.server"
				]
			}
		}
	}
}

OOM log:

Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415019] amazon-cloudwat invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415029] CPU: 1 PID: 3693 Comm: amazon-cloudwat Not tainted 5.15.0-1031-aws #35-Ubuntu
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415032] Hardware name: Amazon EC2 t4g.small/, BIOS 1.0 11/1/2018
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415034] Call trace:
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415034]  dump_backtrace+0x0/0x1ec
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415039]  show_stack+0x20/0x30
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415041]  dump_stack_lvl+0x68/0x84
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415045]  dump_stack+0x18/0x34
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415047]  dump_header+0x54/0x218
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415050]  oom_kill_process+0x228/0x230
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415054]  out_of_memory+0xe4/0x350
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415056]  __alloc_pages_may_oom+0x118/0x194
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415059]  __alloc_pages_slowpath.constprop.0+0x4c4/0x7c4
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415060]  __alloc_pages+0x2a4/0x30c
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415062]  alloc_pages+0xb4/0x1bc
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415065]  __page_cache_alloc+0xd4/0xe4
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415067]  pagecache_get_page+0x144/0x480
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415069]  filemap_fault+0x458/0x650
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415071]  __do_fault+0x44/0x1c0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415074]  do_read_fault+0xe4/0x1b0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415075]  do_fault+0xa8/0x210
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415077]  handle_pte_fault+0x5c/0x21c
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415079]  __handle_mm_fault+0x1e0/0x380
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415081]  handle_mm_fault+0xd0/0x240
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415083]  do_page_fault+0x178/0x520
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415085]  do_translation_fault+0x98/0xe0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415087]  do_mem_abort+0x48/0xc0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415089]  el0_da+0x5c/0x170
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415091]  el0t_64_sync_handler+0xe8/0x130
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415093]  el0t_64_sync+0x1a4/0x1a8
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415095] Mem-Info:
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415096] active_anon:572 inactive_anon:91882 isolated_anon:0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415096]  active_file:113 inactive_file:371 isolated_file:0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415096]  unevictable:345822 dirty:0 writeback:0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415096]  slab_reclaimable:5763 slab_unreclaimable:7812
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415096]  mapped:42261 shmem:241 pagetables:1545 bounce:0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415096]  kernel_misc_reclaimable:0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415096]  free:13052 free_pcp:898 free_cma:0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415101] Node 0 active_anon:2288kB inactive_anon:367528kB active_file:452kB inactive_file:1484kB unevictable:1383288kB isolated(anon):0kB isolated(file):0kB mapped:169044kB dirty:0kB writeback:0kB shmem:964kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:3904kB pagetables:6180kB all_unreclaimable? no
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415105] Node 0 DMA free:28604kB min:26360kB low:31924kB high:37488kB reserved_highatomic:0KB active_anon:32kB inactive_anon:98048kB active_file:48kB inactive_file:1196kB unevictable:799048kB writepending:0kB present:1048576kB managed:941636kB mlocked:799048kB bounce:0kB free_pcp:1608kB local_pcp:856kB free_cma:0kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415110] lowmem_reserve[]: 0 0 932 932 932
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415114] Node 0 Normal free:23604kB min:24836kB low:30532kB high:36228kB reserved_highatomic:0KB active_anon:2256kB inactive_anon:269140kB active_file:336kB inactive_file:544kB unevictable:584240kB writepending:0kB present:995328kB managed:954400kB mlocked:584240kB bounce:0kB free_pcp:1984kB local_pcp:1544kB free_cma:0kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415119] lowmem_reserve[]: 0 0 0 0 0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415122] Node 0 DMA: 233*4kB (UME) 210*8kB (UME) 315*16kB (UME) 195*32kB (UME) 106*64kB (UME) 46*128kB (UME) 8*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 28612kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415136] Node 0 Normal: 862*4kB (UME) 316*8kB (UME) 299*16kB (UME) 167*32kB (UME) 76*64kB (UME) 22*128kB (UME) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 24040kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415150] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415152] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415154] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415155] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415156] 42937 total pagecache pages
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415157] 0 pages in swap cache
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415158] Swap cache stats: add 0, delete 0, find 0/0
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415159] Free swap  = 0kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415160] Total swap = 0kB
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415161] 510976 pages RAM
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415162] 0 pages HighMem/MovableOnly
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415162] 36967 pages reserved
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415163] 0 pages hwpoisoned
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415164] Tasks state (memory values in pages):
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415164] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415167] [    197]     0   197     9977      692   102400        0          -250 systemd-journal
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415170] [    239]     0   239    72416     6417   110592        0         -1000 multipathd
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415173] [    242]     0   242     2726      942    57344        0         -1000 systemd-udevd
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415175] [    452]   100   452     4108      732    69632        0             0 systemd-network
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415178] [    454]   101   454     6241     1541    86016        0             0 systemd-resolve
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415180] [    549]     0   549     1728      541    49152        0             0 cron
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415182] [    550]   102   550     2242      812    57344        0          -900 dbus-daemon
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415185] [    558]     0   558    20519      349    57344        0             0 irqbalance
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415187] [    559]     0   559     8250     2843   110592        0             0 networkd-dispat
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415189] [    561]   104   561    55505      812    81920        0             0 rsyslogd
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415191] [    569]     0   569   236914     4239   258048        0          -900 snapd
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415193] [    570]     0   570     3801      341    73728        0             0 systemd-logind
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415195] [    580]   114   580     4652      457    57344        0             0 chronyd
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415197] [    592]   114   592     2555      130    57344        0             0 chronyd
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415199] [    617]     0   617     1409      174    45056        0             0 agetty
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415201] [    626]     0   626     1398      172    45056        0             0 agetty
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415203] [    654]     0   654    27486     2698   118784        0             0 unattended-upgr
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415205] [    659]     0   659    58835      944    90112        0             0 polkitd
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415207] [    825]     0   825     3789      961    65536        0         -1000 sshd
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415209] [   1711]     0  1711    74387     1414   167936        0             0 packagekitd
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415212] [   2484]     0  2484   326527     1823   180224        0             0 amazon-ssm-agen
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415214] [   2924]     0  2924   328774     2732   200704        0             0 ssm-agent-worke
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415216] [   3355]   115  3355   654058    24772   454656        0             0 java
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415218] [   3414]   115  3414   919484   375589  3207168        0             0 java
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415220] [   3435]   115  3435    25924      530    77824        0             0 controller
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415222] [   3505]     0  3505    20278       89    49152        0             0 gpg-agent
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415224] [   3646]     0  3646   199045     4560   282624        0             0 amazon-cloudwat
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415226] [   3744]     0  3744    10105     4016   126976        0             0 python3
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415228] [   3753]     0  3753    14334    10561   143360        0             0 apt-esm-hook
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415230] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=amazon-cloudwatch-agent.service,mems_allowed=0,global_oom,task_memcg=/system.slice/elasticsearch.service,task=java,pid=3414,uid=115
Apr 12 09:17:40 ip-10-0-103-97 kernel: [ 1972.415287] Out of memory: Killed process 3414 (java) total-vm:3677936kB, anon-rss:1338212kB, file-rss:164144kB, shmem-rss:0kB, UID:115 pgtables:3132kB oom_score_adj:0

I'm baffled. It looks like you have plenty of free memory, and plenty more that could be freed up if needed. Elasticsearch's usage is in line with what I'd expect from its config too, but then none of the other processes are using much. Is any of this running in containers (or cgroups more generally)?

No, we don't use any containers in this Ubuntu installation, we even do not install Docker.

We also don't use any cgroups, at least explicitly. We install just few necessary packages with apt:
apt-transport-https gnutls-bin jq moreutils nvme-cli unzip and aws-cli on top of the brand new Ubuntu installation (provided by AWS ami image).

Maybe worth to mention, in ES8 setup we removed this settings, as
this (commented) variable is no more present in the /etc/default/elasticsearch system configuration file (I am not sure, that this is correct).

MAX_LOCKED_MEMORY=unlimited

But we still use (in elasticsearch.yml):
bootstrap.memory_lock: true

And (in /etc/security/limits.d/99-elasticsearch.conf)

elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited

And (in /etc/systemd/system/elasticsearch.service.d/override.conf)

[Service]
LimitMEMLOCK=infinity

The memlock changes kinda sound related but I don't really see how because you have no swap so none of the locked pages can be swapped out anyway. That said, I would recommend removing bootstrap.memory_lock, you don't need it and maybe it has other effects on Linux's memory management subsystem.

If you run a 7.17 cluster with the same memlock configuration, does it also start to suffer OOM kills?

Thanks David, I'll give it a try

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.