Elasticsearch 5.6.10 on instantly killed

Hi
I use a dedicated elastic cluster, consist of 9 nodes: 3 dedicated master nodes and 6 dedicated data nodes.
The cluster used as a backend to collect logs from all company production servers. Log traffic volume is about 160 Gb/day or 3000-6000 requests per second. Database nodes properties:
Hardware: i3.2xlarge ec2 instance ( 144 Gb RAM, 2X1900 GB NVME SSD disks in stripped LVM raid)
Java Heap Size set to 28 Gb.
Elasticsearch configured to memory lock on bootstrap and machines configured not to use swap volume at all. (no swap partition, VM.swapiness = 1 )
OS: Ubuntu 16.04 x64

I noticed Elastic service on different nodes get killed (probably by OOM)

from kern.log:
Jul 19 02:37:59 awses-dbnode1 kernel: [753817.897047] Out of memory: Kill process 67520 (java) score 526 or sacrifice child
Jul 19 02:37:59 awses-dbnode1 kernel: [753817.901374] Killed process 67520 (java) total-vm:1290332684kB, anon-rss:30413004kB, file-rss:35344396kB

could it be the result of lack of swap usage ?
Is it safe to configure systemd to restart elasticsearch service automatically in case of kill/crash ?

Can you run the free command on the server to see what the memory usage is? It seems like Elasticsearch can't get 28GB of memory to start up.

Hi.

No issue with starting elastic. Server has 144 Gb of RAM.
Process of elastic is getting killed after ,let's say, a week or something like that.

This seems like a memory leak, can you post the JVM settings?

Sure. Here it is:
-Xms28000m
-Xmx28000m
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+AlwaysPreTouch
-server
-Xss1m
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Djdk.io.permissionsUseCanonicalPath=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError

It seems like issue has been resolved by adding swap space to machines, though still leaving vm.swapiness=1. I see very little swap usage on all nodes, but OOM killer is not summoned anymore.
I don`t observe performance degrade in elasticsearch cluster

Nope. Yesterday still got one node killed by OOM. It is something to play with Java heap settings or kernel parameters tuning. I also have upgraded to kernel aws-1063 today.
May be someone can direct me which settings of Java/Kernel to adjust to get rid of OOM killer ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.