Failed to send ping to node after Out Of Memory error

Hello.

I have ElasticSearch 2.4.1 with 3-node cluster.
I am trying to make some performance benchmarks. All nodes have same configuration with 1Gb of memory, SSD on Debian GNU/Linux 8.

java -version
openjdk version "1.8.0_102"
OpenJDK Runtime Environment (build 1.8.0_102-8u102-b14.1-1~bpo8+1-b14)
OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode)

I catch this problem several times. After stress test (11m, ~585000 documents inserted) one of nodes (e-test03) has reached memory limit with the messages in logs:

2016-10-03 14:39:19,801][WARN ][index.engine             ] [e-test03] [stress][0] failed engine [indices:data/write/bulk[s] failed on replica]
MapperParsingException[failed to parse]; nested: OutOfMemoryError[Java heap space];
...
[2016-10-03 14:39:19,801][WARN ][indices.cluster          ] [e-test03] [stress][0] engine failed, but can't find index shard. failure reason: [indices:data/write/bulk[s] failed on replica]
MapperParsingException[failed to parse]; nested: OutOfMemoryError[Java heap space];
...
[2016-10-03 14:39:20,305][WARN ][cluster.service          ] [e-test03] cluster state update task [zen-disco-receive(from master [{e-test01}{59oXa0saRVuUANX2G7nu_w}{192.168.4.84}{192.168.4.84:9300}{max_local_storage_nodes=1, master=true}])] took 51.6s above the warn threshold of 30s
[2016-10-03 14:39:20,864][INFO ][discovery.zen            ] [e-test03] master_left [{e-test01}{59oXa0saRVuUANX2G7nu_w}{192.168.4.84}{192.168.4.84:9300}{max_local_storage_nodes=1, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2016-10-03 14:39:20,865][WARN ][discovery.zen            ] [e-test03] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{e-test03}{7ueoRMiMRQK73DdmzZTHOg}{192.168.4.86}{192.168.4.86:9300}{max_local_storage_nodes=1, master=true},{e-test02}{aYUn8SDZRPeAJJTggegF6A}{192.168.4.85}{192.168.4.85:9300}{max_local_storage_nodes=1, master=true},}
[2016-10-03 14:39:20,865][INFO ][cluster.service          ] [e-test03] removed {{e-test01}{59oXa0saRVuUANX2G7nu_w}{192.168.4.84}{192.168.4.84:9300}{max_local_storage_nodes=1, master=true},}, reason: zen-disco-master_failed ({e-test01}{59oXa0saRVuUANX2G7nu_w}{192.168.4.84}{192.168.4.84:9300}{max_local_storage_nodes=1, master=true})
[2016-10-03 14:39:24,747][WARN ][discovery.zen.ping.unicast] [e-test03] failed to send ping to [{#zen_unicast_2#}{192.168.4.85}{192.168.4.85:9300}]

At the same moment network is fine, I can connect to the both 9200 and 9300 ports or send a curl messages to any node inside cluster from any of them.

The most interesting part is that now I cannot restart the node after Out Of Memory error. After service elasticsearch restart the service has status Active: deactivating (stop-sigterm) since Mon 2016-10-03 15:13:13 EEST; 49s ago (and continue writing pings errors to the log).

The only way that works in this case is pkill -9 _es_java_pid_

The heap size is set to 512m of 1024 available, swap is turned off.

There is no problem that I reached memory limit, but cluster cannot normally return to work. And i think, the other problem that I got before (try to recover [stress][1] from primary shard with sync id but number of docs differ: 2000394 (e-test03, primary) vs 1999718(e-test02)) is related to this.

Why elasticsearch cannot ping other node? Because of Java GC problems? How to prevent this situation in production?

UP: Additional question. Can ES node just correctly exit after OOM has been reached?

I hope, I found a solution for the problem. Or just temporary fix.

First, I have removed ES_HEAP_SIZE limit from the /etc/default/elasticsearch.
Secondary, I have set bootstrap.mlockall to false in nodes configs. After this, speed of my cluster have increased to ~8.9% (comparing to another success benchmarks before, now rate is around 1500 documents / sec). And all nodes still alive after 45min test.

We'd suggest not running less than a 1GB heap, 2GB for production level traffic.

@warkolm, ok, 1Gb is not a big memory for production. But not this is real trouble.
I'm scared about ES behavior after OOM. If I got this problem on the test cluster, the same could happens with critical data on production. And even more, the ES cluster can looks like alive for Zabbix or another monitoring tools, but in reality it's dead without symptoms.
And even if some triggers will catch the problem, I don't like idea to do kill -9 every time when I need to restart node automatically.