Hello.
I have ElasticSearch 2.4.1 with 3-node cluster.
I am trying to make some performance benchmarks. All nodes have same configuration with 1Gb of memory, SSD on Debian GNU/Linux 8.
java -version
openjdk version "1.8.0_102"
OpenJDK Runtime Environment (build 1.8.0_102-8u102-b14.1-1~bpo8+1-b14)
OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode)
I catch this problem several times. After stress test (11m, ~585000 documents inserted) one of nodes (e-test03) has reached memory limit with the messages in logs:
2016-10-03 14:39:19,801][WARN ][index.engine ] [e-test03] [stress][0] failed engine [indices:data/write/bulk[s] failed on replica]
MapperParsingException[failed to parse]; nested: OutOfMemoryError[Java heap space];
...
[2016-10-03 14:39:19,801][WARN ][indices.cluster ] [e-test03] [stress][0] engine failed, but can't find index shard. failure reason: [indices:data/write/bulk[s] failed on replica]
MapperParsingException[failed to parse]; nested: OutOfMemoryError[Java heap space];
...
[2016-10-03 14:39:20,305][WARN ][cluster.service ] [e-test03] cluster state update task [zen-disco-receive(from master [{e-test01}{59oXa0saRVuUANX2G7nu_w}{192.168.4.84}{192.168.4.84:9300}{max_local_storage_nodes=1, master=true}])] took 51.6s above the warn threshold of 30s
[2016-10-03 14:39:20,864][INFO ][discovery.zen ] [e-test03] master_left [{e-test01}{59oXa0saRVuUANX2G7nu_w}{192.168.4.84}{192.168.4.84:9300}{max_local_storage_nodes=1, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2016-10-03 14:39:20,865][WARN ][discovery.zen ] [e-test03] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{e-test03}{7ueoRMiMRQK73DdmzZTHOg}{192.168.4.86}{192.168.4.86:9300}{max_local_storage_nodes=1, master=true},{e-test02}{aYUn8SDZRPeAJJTggegF6A}{192.168.4.85}{192.168.4.85:9300}{max_local_storage_nodes=1, master=true},}
[2016-10-03 14:39:20,865][INFO ][cluster.service ] [e-test03] removed {{e-test01}{59oXa0saRVuUANX2G7nu_w}{192.168.4.84}{192.168.4.84:9300}{max_local_storage_nodes=1, master=true},}, reason: zen-disco-master_failed ({e-test01}{59oXa0saRVuUANX2G7nu_w}{192.168.4.84}{192.168.4.84:9300}{max_local_storage_nodes=1, master=true})
[2016-10-03 14:39:24,747][WARN ][discovery.zen.ping.unicast] [e-test03] failed to send ping to [{#zen_unicast_2#}{192.168.4.85}{192.168.4.85:9300}]
At the same moment network is fine, I can connect to the both 9200 and 9300 ports or send a curl messages to any node inside cluster from any of them.
The most interesting part is that now I cannot restart the node after Out Of Memory error. After service elasticsearch restart
the service has status Active: deactivating (stop-sigterm) since Mon 2016-10-03 15:13:13 EEST; 49s ago
(and continue writing pings errors to the log).
The only way that works in this case is pkill -9 _es_java_pid_
The heap size is set to 512m of 1024 available, swap is turned off.
There is no problem that I reached memory limit, but cluster cannot normally return to work. And i think, the other problem that I got before (try to recover [stress][1] from primary shard with sync id but number of docs differ: 2000394 (e-test03, primary) vs 1999718(e-test02)
) is related to this.
Why elasticsearch cannot ping other node? Because of Java GC problems? How to prevent this situation in production?
UP: Additional question. Can ES node just correctly exit after OOM has been reached?