Elasticsearch Process getting killed

akshaymaniyar · October 29, 2019, 2:35pm

Recently we have started seeing that our elastic-search process is getting killed. We are using 5-5-1 version of elasticsearch.

Any idea what could be the cause?

● elasticsearch.service - Elasticsearch
   Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/elasticsearch.service.d
       └─override.conf
   Active: failed (Result: signal) since Tue 2019-10-29 16:25:50 IST; 3h 31min ago
 Docs: http://www.elastic.co
  Process: 18823 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet -Edefault.path.logs=${LOG_DIR} -Edefault.path.data=${DATA_DIR} -Edefault.path.conf=${CONF_DIR} (code=killed, signal=KILL)
  Process: 18820 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
 Main PID: 18823 (code=killed, signal=KILL)

Below are the kernel logs:

gist.github.com

https://gist.github.com/akshaymaniyar/e1a257beb4e9cdb7e29268b1eed30d4a

kernel-logs

Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852004] java: page allocation failure: order:0, mode:0x2080120(GFP_ATOMIC|__GFP_COLD)
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852012] CPU: 18 PID: 19021 Comm: java Not tainted 4.9.0-9-amd64 #1 Debian 4.9.168-1
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852013] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852015]  0000000000000000 ffffffff8dd35284 ffffffff8e4020f8 ffff89c6ca483bf0
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852019]  ffffffff8db8c43a 02080120ffffc010 ffffffff8e4020f8 ffff89c6ca483b90
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852021]  ffff89c600000010 ffff89c6ca483c00 ffff89c6ca483bb0 0fe84e26c33f3055
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852024] Call Trace:
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852026]  <IRQ>
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852033]  [<ffffffff8dd35284>] ? dump_stack+0x5c/0x78
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103134.852037]  [<ffffffff8db8c43a>] ? warn_alloc+0x13a/0x160

This file has been truncated. show original

Armin_Braun · October 30, 2019, 11:37am

Hi @akshaymaniyar

See these two lines from the logs you pasted:

Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103329.931375] Out of memory: Kill process 18823 (java) score 830 or sacrifice child
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103329.932556] Killed process 18823 (java) total-vm:241920696kB, anon-rss:18411644kB, file-rss:26660112kB, shmem-rss:0kB

The system killed the ES process because it was using too much memory. You'll have to either reduce the amount of memory used by ES (setting smaller heap size via jvm.options), free up memory on the system by other means or (relatively unlikely to be broken unless you made changes to it) fix your system's settings to make the OOM killer less trigger happy.

akshaymaniyar · October 31, 2019, 6:22am

We are running elasticsearch on a machine which has 52 GB ram. We are just running elasticsearch on this machine and have allotted 16GB of heap space.

Below is the elasticsearch process:

elastic+  5727  179 77.2 247020628 42213928 ?  SLsl Oct29 4075:47 /usr/bin/java -Xms16g -Xmx16g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet -Edefault.path.logs=/var/log/elasticsearch -Edefault.path.data=/var/lib/elasticsearch -Edefault.path.conf=/etc/elasticsearch
elastic+  5845  0.0  0.0 131304  7988 ?        Sl   Oct29   0:00 /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/controller

Below is the usual memory usage of the machine

               total        used        free      shared  buff/cache   available
Mem:             52          18           1           0          32          33
Swap:             0           0           0

Are we missing anything?

Armin_Braun · October 31, 2019, 10:13am

This would imply an issue with your system settings from the looks of it.

What are your system's settings for vm .overcommit_memory and vm.swappiness?

akshaymaniyar · November 3, 2019, 1:33pm

Below are the values:

cat /proc/sys/vm/overcommit_memory
0
cat /proc/sys/vm/swappiness
60

What are the recommended ones for a machine running elasticsearch

Armin_Braun · November 3, 2019, 1:50pm

We do not make any recommendations on overcommit_memory as far as I know since the correct setting here is somewhat use-case and system specific.

We do however recommend turning swapping off or at least down significantly (see https://www.elastic.co/guide/en/elasticsearch/reference/5.5/setup-configuration-memory.html). I would recommend moving to a swappiness of 1 as per the linked docs as a value of 60 likely is detrimental to the performance of your ES process because of the risk of random swapping.

This is however not necessarily going to fix the out of memory killer killing the ES process. Given that your system has significantly more memory available than outright needed to run ES with a 16GB heap you could try fixing the issue by allowing the system to allocate more aggressively via

echo 1 > /proc/sys/vm/overcommit_memory

EDIT:

One other thing you should look into is this line:

Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103314.853172] INFO: task java:29057 blocked for more than 120 seconds.

It seems your java process became completely blocked on "something" here. That "something" is most likely disk IO. Could there be a problem with that such that disk IO resources are temporarily almost completely exhausted? What kind of storage hardware do you have backing your cluster?

akshaymaniyar · November 4, 2019, 3:51am

Regarding swappiness, we already have the below setting set in elasticsearch.yml
bootstrap.memory_lock: true

Do you recommend to do both => change the swappiness to 1 and bootstrap.memory_lock: true

Will get back on the IO metrics and the kind of hardware being used.

Armin_Braun · November 4, 2019, 11:57pm

Yes, unrelated to this problem it's still a good idea to turn down swappiness even with mlock in place as mlock does not extend to ES' off-heap memory use.

system · December 3, 2019, 12:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kill error Elasticsearch elastic-stack-monitoring	5	3320	March 2, 2021
Elasticsearch process is killed on Centos 7 Elasticsearch	6	4954	January 2, 2017
Elasticsearch service gets killed Elasticsearch	8	2125	August 28, 2020
Elasticsearch stopped due to (code = kill, signal = kill) Elasticsearch	2	413	October 29, 2020
Every Night- Elasticsearch. (code=killed, signal=KILL) Elasticsearch	9	2871	February 14, 2022

Elasticsearch Process getting killed

Related topics