Elasticsearch Process getting killed

Recently we have started seeing that our elastic-search process is getting killed. We are using 5-5-1 version of elasticsearch.

Any idea what could be the cause?

● elasticsearch.service - Elasticsearch
   Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/elasticsearch.service.d
       └─override.conf
   Active: failed (Result: signal) since Tue 2019-10-29 16:25:50 IST; 3h 31min ago
 Docs: http://www.elastic.co
  Process: 18823 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet -Edefault.path.logs=${LOG_DIR} -Edefault.path.data=${DATA_DIR} -Edefault.path.conf=${CONF_DIR} (code=killed, signal=KILL)
  Process: 18820 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
 Main PID: 18823 (code=killed, signal=KILL)

Below are the kernel logs:

Hi @akshaymaniyar

See these two lines from the logs you pasted:

Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103329.931375] Out of memory: Kill process 18823 (java) score 830 or sacrifice child
Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103329.932556] Killed process 18823 (java) total-vm:241920696kB, anon-rss:18411644kB, file-rss:26660112kB, shmem-rss:0kB

The system killed the ES process because it was using too much memory. You'll have to either reduce the amount of memory used by ES (setting smaller heap size via jvm.options), free up memory on the system by other means or (relatively unlikely to be broken unless you made changes to it) fix your system's settings to make the OOM killer less trigger happy.

We are running elasticsearch on a machine which has 52 GB ram. We are just running elasticsearch on this machine and have allotted 16GB of heap space.

Below is the elasticsearch process:

elastic+  5727  179 77.2 247020628 42213928 ?  SLsl Oct29 4075:47 /usr/bin/java -Xms16g -Xmx16g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet -Edefault.path.logs=/var/log/elasticsearch -Edefault.path.data=/var/lib/elasticsearch -Edefault.path.conf=/etc/elasticsearch
elastic+  5845  0.0  0.0 131304  7988 ?        Sl   Oct29   0:00 /usr/share/elasticsearch/plugins/x-pack/platform/linux-x86_64/bin/controller

Below is the usual memory usage of the machine

               total        used        free      shared  buff/cache   available
Mem:             52          18           1           0          32          33
Swap:             0           0           0

Are we missing anything?

This would imply an issue with your system settings from the looks of it.

What are your system's settings for vm .overcommit_memory and vm.swappiness?

Below are the values:

cat /proc/sys/vm/overcommit_memory
0
cat /proc/sys/vm/swappiness
60

What are the recommended ones for a machine running elasticsearch

We do not make any recommendations on overcommit_memory as far as I know since the correct setting here is somewhat use-case and system specific.

We do however recommend turning swapping off or at least down significantly (see https://www.elastic.co/guide/en/elasticsearch/reference/5.5/setup-configuration-memory.html). I would recommend moving to a swappiness of 1 as per the linked docs as a value of 60 likely is detrimental to the performance of your ES process because of the risk of random swapping.

This is however not necessarily going to fix the out of memory killer killing the ES process. Given that your system has significantly more memory available than outright needed to run ES with a 16GB heap you could try fixing the issue by allowing the system to allocate more aggressively via

echo 1 > /proc/sys/vm/overcommit_memory

EDIT:

One other thing you should look into is this line:

Oct 29 16:25:48 cms-zulu-datastore-none-1551817 kernel: [103314.853172] INFO: task java:29057 blocked for more than 120 seconds.

It seems your java process became completely blocked on "something" here. That "something" is most likely disk IO. Could there be a problem with that such that disk IO resources are temporarily almost completely exhausted? What kind of storage hardware do you have backing your cluster?

Regarding swappiness, we already have the below setting set in elasticsearch.yml
bootstrap.memory_lock: true

Do you recommend to do both => change the swappiness to 1 and bootstrap.memory_lock: true

Will get back on the IO metrics and the kind of hardware being used.

Yes, unrelated to this problem it's still a good idea to turn down swappiness even with mlock in place as mlock does not extend to ES' off-heap memory use.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.