Elasticsearch 5.2.2 - Out of memory - while indexing

I am running a cluster of 3 Elasticsearch 5.2.2 nodes.
Each of the node is 4GB Memory, 2 vCPU, t2.medium EC2 node and running in different geo-located DCs EU, AP, US (not recommended setup). They're connected to each other using SSH tunnels.

The problem I am experiencing is that AP and US node dies eventually after few hours from starting the indexing. They don't usually die at the same time.

The EU node doesn't die, all write requests go to EU node, that probably might be why.
Also we use the "async" write requests for indexing, not sure if it might be the cause of the problems if "remote nodes" can't catch up for some reason.

From logs I understand that ES instances were killed likely by the OOM killer.

My heap size is set to:

-Xms2g -Xmx2g

Memory is not locked with bootstrap, as we don't have SWAP partition anyway.

Any ideas where what to look for?

I know that the remote DC setup is not ideal, but that's what we do already for years (using ES 1.4.1), I've plan to change the architecture, but until then I've wanted to upgrade ES, as 1.4.1 is simply old.

ES AP node Log:

[2017-03-13T12:52:11,336][INFO ][o.e.p.PluginsService ] [v2-es-ap] no plugins loaded
[2017-03-13T12:52:13,990][INFO ][o.e.n.Node ] [v2-es-ap] initialized
[2017-03-13T12:52:13,990][INFO ][o.e.n.Node ] [v2-es-ap] starting ...
[2017-03-13T12:52:14,099][WARN ][i.n.u.i.MacAddressUtil ] Failed to find a usable hardware address from the network interfaces; using random bytes: 54:bc:a3:8e:56:f6:8a:87
[2017-03-13T12:52:14,158][INFO ][o.e.t.TransportService ] [v2-es-ap] publish_address {127.0.0.1:9303}, bound_addresses {10.10.0.27:9303}, {[::1]:9303}, {127.0.0.1:9303}, {[fe80::446:7aff:fe86:c29]:9303}
[2017-03-13T12:52:14,162][INFO ][o.e.b.BootstrapChecks ] [v2-es-ap] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-03-13T12:52:16,200][INFO ][o.e.c.s.ClusterService ] [v2-es-ap] detected_master {v2-es-eu}{6zRFMNTPTrSCsOia1hy_oA}{uEtCyd2vQQa8nonzXV20Kw}{localhost}{127.0.0.1:9301}, added {{v2-es-us}{iimGscvFTyiGA8m50QPrFQ}{tntZTWWNQyaY_v7QI1e18w}{localhost}{127.0.0.1:9302},{v2-es-eu}{6zRFMNTPTrSCsOia1hy_oA}{uEtCyd2vQQa8nonzXV20Kw}{localhost}{127.0.0.1:9301},}, reason: zen-disco-receive(from master [master {v2-es-eu}{6zRFMNTPTrSCsOia1hy_oA}{uEtCyd2vQQa8nonzXV20Kw}{localhost}{127.0.0.1:9301} committed version [211]])
[2017-03-13T12:52:16,601][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][4]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,606][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][2]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,606][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][1]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,607][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][3]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,608][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][0]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,668][INFO ][o.e.h.HttpServer ] [v2-es-ap] publish_address {127.0.0.1:9200}, bound_addresses {10.10.0.27:9200}, {[::1]:9200}, {127.0.0.1:9200}, {[fe80::446:7aff:fe86:c29]:9200}
[2017-03-13T12:52:16,668][INFO ][o.e.n.Node ] [v2-es-ap] started

ES AP node service status:

  • elasticsearch.service - Elasticsearch
    Active: failed (Result: signal) since Mon 2017-03-13 13:51:25 UTC; 16h ago
    Process: 21175 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
    Main PID: 21181 (code=killed, signal=KILL)

Mar 13 12:52:07 v2-es5-ap systemd[1]: Starting Elasticsearch...
Mar 13 12:52:07 v2-es5-ap systemd[1]: Started Elasticsearch.
Mar 13 13:51:25 v2-es5-ap systemd[1]: elasticsearch.service: Main process exited, code=killed, status=9/KILL
Mar 13 13:51:25 v2-es5-ap systemd[1]: elasticsearch.service: Unit entered failed state.
Mar 13 13:51:25 v2-es5-ap systemd[1]: elasticsearch.service: Failed with result 'signal'.

ES US node service status:

Active: failed (Result: signal) since Mon 2017-03-13 22:05:26 UTC; 9h ago
Process: 6803 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet -Edefault.path.logs=${LOG_DIR} -Edefault.path.data=${DATA_DIR} -Edefault.path.conf=${CONF_DIR} (code=killed, signal=KILL)
Process: 6798 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
Main PID: 6803 (code=killed, signal=KILL)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

I've managed to find some output from journalctl, which clearly says:

Mar 15 08:03:11 v2-es5-us kernel: Out of memory (oom_kill_allocating_task): Kill process 17552 (java) score 0 or sacrifice child
Mar 15 08:03:11 v2-es5-us kernel: Killed process 17482 (java) total-vm:38214392kB, anon-rss:2420632kB, file-rss:492624kB

Full log:
http://pastebin.com/bEMzc7AD

I've installed a monitoring to see if there is any RAM resource issue, but can't see any.
You can clearly see the moment when the ES has died freeing some of the memory, but before it has died there was plenty left.

Also this time not Java process was killed, but update-motd-fsc

Mar 16 08:05:45 v2-es5-ap kernel: Out of memory (oom_kill_allocating_task): Kill process 27308 (update-motd-fsc) score 0 or sacrifice child
Mar 16 08:05:45 v2-es5-ap kernel: Killed process 27309 (update-motd-fsc) total-vm:4508kB, anon-rss:124kB, file-rss:0kB

Any ideas, what this might be? :slight_smile:

Disable OOM Killer.

See Out of memory: Kill process

I was recently experiencing the same issue. I was running on CentOS 7 with kernel-3.10.0-514.6.2.el7.x86_64. Upgrading to kernel-3.10.0-514.10.2.el7.x86_64 solved the issue.

I got the problem kernel at the same time as 5.2.2 because I had simply updated the whole box using yum update. This caused me to suspect it was an Elasticsearch issue. However, since the kernel upgrade a few days later everything has been rock solid, with memory usage as would be expected.

Might be worth looking into.

Rob

I am running Ubuntu 16.04, the kernel is: 4.4.0-66-generic.
I've changed the RAM on these nodes temporarily from 4GB to 16GB and problem has gone, but I can't afford running 16GB nodes and I believe that 4GB shall be sufficient, as there is production cluster on ES 1.4.2 running with 4GB nodes on Ubuntu 14.04 without any issues.

I've got the vm.oom_kill_allocating_task set to 1 on dev system.

When set the vm.overcommit_memory = 2, I couldn't start ES:

Mar 16 10:44:45 v2-es5-us systemd[1]: Started Elasticsearch.
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000008a660000, 1973026816, 0) failed; erro
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: #
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: # There is insufficient memory for the Java Runtime Environment to continue.
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: # Native memory allocation (mmap) failed to map 1973026816 bytes for committing reserved memory.
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: # An error report file with more information is saved as:
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: # /tmp/hs_err_pid30255.log

I've set vm.oom_kill_allocating_task back to default 0 and indexing seem to be working stable already for an hour. Sounds like a win, but I am keeping an eye on it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.