Hi
I use a dedicated elastic cluster, consist of 9 nodes: 3 dedicated master nodes and 6 dedicated data nodes.
The cluster used as a backend to collect logs from all company production servers. Log traffic volume is about 160 Gb/day or 3000-6000 requests per second. Database nodes properties:
Hardware: i3.2xlarge ec2 instance ( 144 Gb RAM, 2X1900 GB NVME SSD disks in stripped LVM raid)
Java Heap Size set to 28 Gb.
Elasticsearch configured to memory lock on bootstrap and machines configured not to use swap volume at all. (no swap partition, VM.swapiness = 1 )
OS: Ubuntu 16.04 x64
I noticed Elastic service on different nodes get killed (probably by OOM)
from kern.log:
Jul 19 02:37:59 awses-dbnode1 kernel: [753817.897047] Out of memory: Kill process 67520 (java) score 526 or sacrifice child
Jul 19 02:37:59 awses-dbnode1 kernel: [753817.901374] Killed process 67520 (java) total-vm:1290332684kB, anon-rss:30413004kB, file-rss:35344396kB
could it be the result of lack of swap usage ?
Is it safe to configure systemd to restart elasticsearch service automatically in case of kill/crash ?
It seems like issue has been resolved by adding swap space to machines, though still leaving vm.swapiness=1. I see very little swap usage on all nodes, but OOM killer is not summoned anymore.
I don`t observe performance degrade in elasticsearch cluster
Nope. Yesterday still got one node killed by OOM. It is something to play with Java heap settings or kernel parameters tuning. I also have upgraded to kernel aws-1063 today.
May be someone can direct me which settings of Java/Kernel to adjust to get rid of OOM killer ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.