Elasticsearch 5.2.2 - Out of memory - while indexing

tomekit · March 14, 2017, 7:22am

I am running a cluster of 3 Elasticsearch 5.2.2 nodes.
Each of the node is 4GB Memory, 2 vCPU, t2.medium EC2 node and running in different geo-located DCs EU, AP, US (not recommended setup). They're connected to each other using SSH tunnels.

The problem I am experiencing is that AP and US node dies eventually after few hours from starting the indexing. They don't usually die at the same time.

The EU node doesn't die, all write requests go to EU node, that probably might be why.
Also we use the "async" write requests for indexing, not sure if it might be the cause of the problems if "remote nodes" can't catch up for some reason.

From logs I understand that ES instances were killed likely by the OOM killer.

My heap size is set to:

-Xms2g -Xmx2g

Memory is not locked with bootstrap, as we don't have SWAP partition anyway.

Any ideas where what to look for?

I know that the remote DC setup is not ideal, but that's what we do already for years (using ES 1.4.1), I've plan to change the architecture, but until then I've wanted to upgrade ES, as 1.4.1 is simply old.

ES AP node Log:

[2017-03-13T12:52:11,336][INFO ][o.e.p.PluginsService ] [v2-es-ap] no plugins loaded
[2017-03-13T12:52:13,990][INFO ][o.e.n.Node ] [v2-es-ap] initialized
[2017-03-13T12:52:13,990][INFO ][o.e.n.Node ] [v2-es-ap] starting ...
[2017-03-13T12:52:14,099][WARN ][i.n.u.i.MacAddressUtil ] Failed to find a usable hardware address from the network interfaces; using random bytes: 54:bc:a3:8e:56:f6:8a:87
[2017-03-13T12:52:14,158][INFO ][o.e.t.TransportService ] [v2-es-ap] publish_address {127.0.0.1:9303}, bound_addresses {10.10.0.27:9303}, {[::1]:9303}, {127.0.0.1:9303}, {[fe80::446:7aff:fe86:c29]:9303}
[2017-03-13T12:52:14,162][INFO ][o.e.b.BootstrapChecks ] [v2-es-ap] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-03-13T12:52:16,200][INFO ][o.e.c.s.ClusterService ] [v2-es-ap] detected_master {v2-es-eu}{6zRFMNTPTrSCsOia1hy_oA}{uEtCyd2vQQa8nonzXV20Kw}{localhost}{127.0.0.1:9301}, added {{v2-es-us}{iimGscvFTyiGA8m50QPrFQ}{tntZTWWNQyaY_v7QI1e18w}{localhost}{127.0.0.1:9302},{v2-es-eu}{6zRFMNTPTrSCsOia1hy_oA}{uEtCyd2vQQa8nonzXV20Kw}{localhost}{127.0.0.1:9301},}, reason: zen-disco-receive(from master [master {v2-es-eu}{6zRFMNTPTrSCsOia1hy_oA}{uEtCyd2vQQa8nonzXV20Kw}{localhost}{127.0.0.1:9301} committed version [211]])
[2017-03-13T12:52:16,601][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][4]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,606][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][2]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,606][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][1]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,607][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][3]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,608][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][0]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,668][INFO ][o.e.h.HttpServer ] [v2-es-ap] publish_address {127.0.0.1:9200}, bound_addresses {10.10.0.27:9200}, {[::1]:9200}, {127.0.0.1:9200}, {[fe80::446:7aff:fe86:c29]:9200}
[2017-03-13T12:52:16,668][INFO ][o.e.n.Node ] [v2-es-ap] started

ES AP node service status:

elasticsearch.service - Elasticsearch
Active: failed (Result: signal) since Mon 2017-03-13 13:51:25 UTC; 16h ago
Process: 21175 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
Main PID: 21181 (code=killed, signal=KILL)

Mar 13 12:52:07 v2-es5-ap systemd[1]: Starting Elasticsearch...
Mar 13 12:52:07 v2-es5-ap systemd[1]: Started Elasticsearch.
Mar 13 13:51:25 v2-es5-ap systemd[1]: elasticsearch.service: Main process exited, code=killed, status=9/KILL
Mar 13 13:51:25 v2-es5-ap systemd[1]: elasticsearch.service: Unit entered failed state.
Mar 13 13:51:25 v2-es5-ap systemd[1]: elasticsearch.service: Failed with result 'signal'.

ES US node service status:

Active: failed (Result: signal) since Mon 2017-03-13 22:05:26 UTC; 9h ago
Process: 6803 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet -Edefault.path.logs=${LOG_DIR} -Edefault.path.data=${DATA_DIR} -Edefault.path.conf=${CONF_DIR} (code=killed, signal=KILL)
Process: 6798 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
Main PID: 6803 (code=killed, signal=KILL)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

tomekit · March 15, 2017, 8:33am

I've managed to find some output from journalctl, which clearly says:

Mar 15 08:03:11 v2-es5-us kernel: Out of memory (oom_kill_allocating_task): Kill process 17552 (java) score 0 or sacrifice child
Mar 15 08:03:11 v2-es5-us kernel: Killed process 17482 (java) total-vm:38214392kB, anon-rss:2420632kB, file-rss:492624kB

Full log:

tomekit · March 16, 2017, 8:54am

I've installed a monitoring to see if there is any RAM resource issue, but can't see any.
You can clearly see the moment when the ES has died freeing some of the memory, but before it has died there was plenty left.

Also this time not Java process was killed, but update-motd-fsc

Mar 16 08:05:45 v2-es5-ap kernel: Out of memory (oom_kill_allocating_task): Kill process 27308 (update-motd-fsc) score 0 or sacrifice child
Mar 16 08:05:45 v2-es5-ap kernel: Killed process 27309 (update-motd-fsc) total-vm:4508kB, anon-rss:124kB, file-rss:0kB

Any ideas, what this might be?

dadoonet · March 16, 2017, 9:16am

Disable OOM Killer.

See Out of memory: Kill process

rcowart · March 16, 2017, 9:26am

I was recently experiencing the same issue. I was running on CentOS 7 with kernel-3.10.0-514.6.2.el7.x86_64. Upgrading to kernel-3.10.0-514.10.2.el7.x86_64 solved the issue.

I got the problem kernel at the same time as 5.2.2 because I had simply updated the whole box using yum update. This caused me to suspect it was an Elasticsearch issue. However, since the kernel upgrade a few days later everything has been rock solid, with memory usage as would be expected.

Might be worth looking into.

Rob

tomekit · March 16, 2017, 12:15pm

I am running Ubuntu 16.04, the kernel is: 4.4.0-66-generic.
I've changed the RAM on these nodes temporarily from 4GB to 16GB and problem has gone, but I can't afford running 16GB nodes and I believe that 4GB shall be sufficient, as there is production cluster on ES 1.4.2 running with 4GB nodes on Ubuntu 14.04 without any issues.

I've got the vm.oom_kill_allocating_task set to 1 on dev system.

When set the vm.overcommit_memory = 2, I couldn't start ES:

Mar 16 10:44:45 v2-es5-us systemd[1]: Started Elasticsearch.
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000008a660000, 1973026816, 0) failed; erro
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: #
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: # There is insufficient memory for the Java Runtime Environment to continue.
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: # Native memory allocation (mmap) failed to map 1973026816 bytes for committing reserved memory.
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: # An error report file with more information is saved as:
Mar 16 10:44:45 v2-es5-us elasticsearch[30255]: # /tmp/hs_err_pid30255.log

I've set vm.oom_kill_allocating_task back to default 0 and indexing seem to be working stable already for an hour. Sounds like a win, but I am keeping an eye on it.

system · April 13, 2017, 12:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch 5.2.2 : Memory keeps on increasing steadily untill ES gets killed by System OOM Killer Elasticsearch	4	1243	June 12, 2017
Upgrading cluster 5.2: Out of memory: Kill process Elasticsearch	10	5032	April 12, 2017
Elasticsearch process on a node takes 99% of RAM as seen by top, and eventually gets killed by kernel Elasticsearch	1	761	July 6, 2017
OOM killer triggered and machine crashes after using up all memory Elasticsearch	10	4678	May 12, 2019
Simultaneous OutOfMemoryErrors across multiple nodes in cluster Elasticsearch	4	354	July 6, 2017

Elasticsearch 5.2.2 - Out of memory - while indexing

Related topics