Hello!
Elasticsearch version: 5.2.2
JVM version: 1.8.0_144-b01
OS version: Ubuntu 16.04.1 LTS
Kernel version: 4.10.0-32-generic #36~16.04.1-Ubuntu
RAM: 64GB
JVM heap min and max settings: 31GB
Cluster Elasticsearch include 14 data nodes with 64GB RAM (JVM heap min and max settings 31GB), data size 3.86TB. Nodes differ JVM version and OS kernel version. When we started snapshot on some nodes elasticsearch service began fail. In syslog I got the following messages:
Mar 4 09:20:28 puma kernel: [48037073.162924] java invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0
Mar 4 09:20:28 puma kernel: [48037073.162924] java cpuset=/ mems_allowed=0
Mar 4 09:20:28 puma kernel: [48037073.162927] CPU: 1 PID: 18234 Comm: java Not tainted 4.10.0-32-generic #36~16.04.1-Ubuntu
...
Mar 4 09:20:28 puma kernel: [48037073.162960] Mem-Info:
Mar 4 09:20:28 puma kernel: [48037073.162962] active_anon:192090 inactive_anon:69654 isolated_anon:0
Mar 4 09:20:28 puma kernel: [48037073.162962] active_file:6988872 inactive_file:558679 isolated_file:128
Mar 4 09:20:28 puma kernel: [48037073.162962] unevictable:8337921 dirty:558048 writeback:737 unstable:0
Mar 4 09:20:28 puma kernel: [48037073.162962] slab_reclaimable:118822 slab_unreclaimable:9965
Mar 4 09:20:28 puma kernel: [48037073.162962] mapped:4169034 shmem:165032 pagetables:37529 bounce:0
Mar 4 09:20:28 puma kernel: [48037073.162962] free:81835 free_pcp:87 free_cma:0
Mar 4 09:20:28 puma kernel: [48037073.162964] Node 0 active_anon:768360kB inactive_anon:278616kB active_file:27955488kB inactive_file:2234716kB unevictable:33351684kB isolated(anon):0kB isolated(file):512kB mapped:16676136kB dirty:2232192kB writeback:2948kB shmem:660128kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 33370112kB writeback_tmp:0kB unstable:0kB pages_scanned:45789873 all_unreclaimable? yes
Mar 4 09:20:28 puma kernel: [48037073.162964] Node 0 DMA free:15900kB min:16kB low:28kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15900kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar 4 09:20:28 puma kernel: [48037073.162966] lowmem_reserve[]: 0 3009 64147 64147 64147
Mar 4 09:20:28 puma kernel: [48037073.162967] Node 0 DMA32 free:247712kB min:3168kB low:6248kB high:9328kB active_anon:32968kB inactive_anon:7368kB active_file:2220680kB inactive_file:450976kB unevictable:1512kB writepending:450976kB present:3185620kB managed:3120052kB mlocked:1512kB slab_reclaimable:127112kB slab_unreclaimable:3480kB kernel_stack:48kB pagetables:27236kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar 4 09:20:28 puma kernel: [48037073.162970] lowmem_reserve[]: 0 0 61138 61138 61138
Mar 4 09:20:28 puma kernel: [48037073.162971] Node 0 Normal free:63728kB min:64396kB low:127000kB high:189604kB active_anon:735392kB inactive_anon:271248kB active_file:25734808kB inactive_file:1783740kB unevictable:33350172kB writepending:1784164kB present:63676416kB managed:62609048kB mlocked:33350172kB slab_reclaimable:348176kB slab_unreclaimable:36380kB kernel_stack:4336kB pagetables:122880kB bounce:0kB free_pcp:348kB local_pcp:0kB free_cma:0kB
Mar 4 09:20:28 puma kernel: [48037073.162973] lowmem_reserve[]: 0 0 0 0 0
Mar 4 09:20:28 puma kernel: [48037073.162974] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15900kB
Mar 4 09:20:28 puma kernel: [48037073.162979] Node 0 DMA32: 320*4kB (UME) 386*8kB (UME) 1481*16kB (UME) 808*32kB (UME) 184*64kB (UME) 34*128kB (UE) 14*256kB (ME) 8*512kB (UME) 12*1024kB (U) 1*2048kB (M) 38*4096kB (UM) = 247712kB
Mar 4 09:20:28 puma kernel: [48037073.162984] Node 0 Normal: 3630*4kB (UME) 2703*8kB (ME) 1566*16kB (UME) 79*32kB (UME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 63728kB
Mar 4 09:20:28 puma kernel: [48037073.162988] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar 4 09:20:28 puma kernel: [48037073.162989] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 4 09:20:28 puma kernel: [48037073.162989] 7719314 total pagecache pages
Mar 4 09:20:28 puma kernel: [48037073.162990] 0 pages in swap cache
Mar 4 09:20:28 puma kernel: [48037073.162990] Swap cache stats: add 0, delete 0, find 0/0
Mar 4 09:20:28 puma kernel: [48037073.162990] Free swap = 0kB
Mar 4 09:20:28 puma kernel: [48037073.162991] Total swap = 0kB
Mar 4 09:20:28 puma kernel: [48037073.162991] 16719505 pages RAM
Mar 4 09:20:28 puma kernel: [48037073.162991] 0 pages HighMem/MovableOnly
Mar 4 09:20:28 puma kernel: [48037073.162991] 283255 pages reserved
Mar 4 09:20:28 puma kernel: [48037073.162992] 0 pages cma reserved
Mar 4 09:20:28 puma kernel: [48037073.162992] 0 pages hwpoisoned
...
Mar 4 09:20:28 puma kernel: [48037073.163014] Out of memory: Kill process 18098 (java) score 768 or sacrifice child
Mar 4 09:20:28 puma kernel: [48037073.163176] Killed process 18098 (java) total-vm:214719256kB, anon-rss:33687608kB, file-rss:16705264kB, shmem-rss:0kB
Mar 4 09:20:28 puma kernel: [48037073.750186] oom_reaper: reaped process 18098 (java), now anon-rss:33325188kB, file-rss:21732kB, shmem-rss:0kB
Mar 4 09:24:40 puma systemd[1]: elasticsearch.service: Main process exited, code=killed, status=9/KILL
Mar 4 09:24:40 puma systemd[1]: elasticsearch.service: Unit entered failed state.
Mar 4 09:24:40 puma systemd[1]: elasticsearch.service: Failed with result 'signal'.
On nodes with OS kernel version 4.4.0-66-generic #87-Ubuntu no problem.
Configuration of Elasticsearch on problem nodes:
cluster.name: elastic
node.master: false
node.data: true
node.ingest: false
node.name: puma
bootstrap.memory_lock: true
http.port: 9221
http.host: puma.XXX.XXX
transport.tcp.port: 9331
transport.host: puma.XXX.XXX
discovery.zen.fd.ping_interval: 1d
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.unicast.hosts: ['XXX.XXX.XXX']
### BEGIN ### Elasticsearch repository for snapshots ###
path.repo: ["/mnt/snapshot"]
### END ### Elasticsearch repository for snapshots ###
This problem similar https://github.com/elastic/elasticsearch/issues/22788 , but I have another versions of Elasticsearch, JVM and OS kernel.
It is problem with OS kernel like https://github.com/elastic/elasticsearch/issues/22788 or maybe with configuration Elasticsearch?