I am running a cluster of 3 Elasticsearch 5.2.2 nodes.
Each of the node is 4GB Memory, 2 vCPU, t2.medium EC2 node and running in different geo-located DCs EU, AP, US (not recommended setup). They're connected to each other using SSH tunnels.
The problem I am experiencing is that AP and US node dies eventually after few hours from starting the indexing. They don't usually die at the same time.
The EU node doesn't die, all write requests go to EU node, that probably might be why.
Also we use the "async" write requests for indexing, not sure if it might be the cause of the problems if "remote nodes" can't catch up for some reason.
From logs I understand that ES instances were killed likely by the OOM killer.
My heap size is set to:
-Xms2g -Xmx2g
Memory is not locked with bootstrap, as we don't have SWAP partition anyway.
Any ideas where what to look for?
I know that the remote DC setup is not ideal, but that's what we do already for years (using ES 1.4.1), I've plan to change the architecture, but until then I've wanted to upgrade ES, as 1.4.1 is simply old.
ES AP node Log:
[2017-03-13T12:52:11,336][INFO ][o.e.p.PluginsService ] [v2-es-ap] no plugins loaded
[2017-03-13T12:52:13,990][INFO ][o.e.n.Node ] [v2-es-ap] initialized
[2017-03-13T12:52:13,990][INFO ][o.e.n.Node ] [v2-es-ap] starting ...
[2017-03-13T12:52:14,099][WARN ][i.n.u.i.MacAddressUtil ] Failed to find a usable hardware address from the network interfaces; using random bytes: 54:bc:a3:8e:56:f6:8a:87
[2017-03-13T12:52:14,158][INFO ][o.e.t.TransportService ] [v2-es-ap] publish_address {127.0.0.1:9303}, bound_addresses {10.10.0.27:9303}, {[::1]:9303}, {127.0.0.1:9303}, {[fe80::446:7aff:fe86:c29]:9303}
[2017-03-13T12:52:14,162][INFO ][o.e.b.BootstrapChecks ] [v2-es-ap] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-03-13T12:52:16,200][INFO ][o.e.c.s.ClusterService ] [v2-es-ap] detected_master {v2-es-eu}{6zRFMNTPTrSCsOia1hy_oA}{uEtCyd2vQQa8nonzXV20Kw}{localhost}{127.0.0.1:9301}, added {{v2-es-us}{iimGscvFTyiGA8m50QPrFQ}{tntZTWWNQyaY_v7QI1e18w}{localhost}{127.0.0.1:9302},{v2-es-eu}{6zRFMNTPTrSCsOia1hy_oA}{uEtCyd2vQQa8nonzXV20Kw}{localhost}{127.0.0.1:9301},}, reason: zen-disco-receive(from master [master {v2-es-eu}{6zRFMNTPTrSCsOia1hy_oA}{uEtCyd2vQQa8nonzXV20Kw}{localhost}{127.0.0.1:9301} committed version [211]])
[2017-03-13T12:52:16,601][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][4]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,606][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][2]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,606][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][1]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,607][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][3]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,608][WARN ][o.e.i.c.IndicesClusterStateService] [v2-es-ap] [[declarations][0]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-03-13T12:52:16,668][INFO ][o.e.h.HttpServer ] [v2-es-ap] publish_address {127.0.0.1:9200}, bound_addresses {10.10.0.27:9200}, {[::1]:9200}, {127.0.0.1:9200}, {[fe80::446:7aff:fe86:c29]:9200}
[2017-03-13T12:52:16,668][INFO ][o.e.n.Node ] [v2-es-ap] started
ES AP node service status:
- elasticsearch.service - Elasticsearch
Active: failed (Result: signal) since Mon 2017-03-13 13:51:25 UTC; 16h ago
Process: 21175 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
Main PID: 21181 (code=killed, signal=KILL)
Mar 13 12:52:07 v2-es5-ap systemd[1]: Starting Elasticsearch...
Mar 13 12:52:07 v2-es5-ap systemd[1]: Started Elasticsearch.
Mar 13 13:51:25 v2-es5-ap systemd[1]: elasticsearch.service: Main process exited, code=killed, status=9/KILL
Mar 13 13:51:25 v2-es5-ap systemd[1]: elasticsearch.service: Unit entered failed state.
Mar 13 13:51:25 v2-es5-ap systemd[1]: elasticsearch.service: Failed with result 'signal'.
ES US node service status:
Active: failed (Result: signal) since Mon 2017-03-13 22:05:26 UTC; 9h ago
Process: 6803 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet -Edefault.path.logs=${LOG_DIR} -Edefault.path.data=${DATA_DIR} -Edefault.path.conf=${CONF_DIR} (code=killed, signal=KILL)
Process: 6798 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
Main PID: 6803 (code=killed, signal=KILL)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.