We have 3 aws instances of type m4.2xlarge being used for elasticsearch with 1 master node and 2 shards; total of 3 indices , one of the index is holding 13 millions of data.
we never had this issue before. these instances have been running since July and never restarted the ES services since then.
all of a sudden one of the instances is terminated due to EC2 health check failed and then ES cluster was not in green. with in few mins , new instance was created re-launched and added to ES cluster and turned to green.
question is why instances was terminated , was this due to memory issue ? I could see 'Unable to lock JVM Memory' on all 3 instances but only one of them got terminated. please help us understand the rootcause and resolution.
Logs
ES3 instance logs - instance got terminated
[2018-11-12 15:02:58,603][INFO ][discovery.ec2 ] [es3-prod] master_left [{es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2018-11-12 15:02:58,606][WARN ][discovery.ec2 ] [es3-prod] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{es1-prod}{pvkYfa9zR8Klivf7Y4x0Bg}{XXXX}{XXXX:9300},{es3-prod}{a71vFKSHRaiZUO22Q9yJ6g}{XXXX}{XXXX:9300},}
[2018-11-12 15:02:58,606][INFO ][cluster.service ] [es3-prod] removed {{es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300},}, reason: zen-disco-master_failed ({es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300})
[2018-11-12 15:02:58,607][DEBUG][action.admin.cluster.health] [es3-prod] connection exception while trying to forward request to master node [{es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [es2-prod][XXXX:9300][cluster:monitor/health] disconnected]
[2018-11-12 15:02:58,607][INFO ][rest.suppressed ] /_cluster/health Params: {}
MasterNotDiscoveredException
at org.elasticsearch.action.support.master.TransportMasterNodeAction$6.handleException(TransportMasterNodeAction.java:195)
at org.elasticsearch.transport.TransportService$Adapter$3.run(TransportService.java:588)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
[2018-11-12 15:19:45,226][DEBUG][action.admin.indices.exists.indices] [es3-prod] no known master node, scheduling a retry
[2018-11-12 15:19:49,795][INFO ][cluster.service ] [es3-prod] detected_master {es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300}, added {{es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300},}, reason: zen-disco-receive(from master [{es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300}])
[2018-11-12 15:24:41,337][INFO ][cluster.service ] [es3-prod] removed {{es1-prod}{pvkYfa9zR8Klivf7Y4x0Bg}{XXXX}{XXXX:9300},}, reason: zen-disco-receive(from master [{es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300}])
[2018-11-12 15:24:59,399][INFO ][cluster.service ] [es3-prod] added {{es1-prod}{2T2eQar7RZGwd1u1zbzs9w}{XXXX}{XXXX:9300},}, reason: zen-disco-receive(from master [{es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300}])
[2018-11-12 15:26:53,044][INFO ][discovery.ec2 ] [es3-prod] master_left [{es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300}], reason [shut_down]
[2018-11-12 15:26:53,045][WARN ][discovery.ec2 ] [es3-prod] master left (reason = shut_down), current nodes: {{es3-prod}{a71vFKSHRaiZUO22Q9yJ6g}{XXXX}{XXXX:9300},{es1-prod}{2T2eQar7RZGwd1u1zbzs9w}{XXXX}{XXXX:9300},}
[2018-11-12 15:26:53,045][INFO ][cluster.service ] [es3-prod] removed {{es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300},}, reason: zen-disco-master_failed ({es2-prod}{3bHdXNIWS-qVFY75qtefNQ}{XXXX}{XXXX:9300})
[2018-11-12 15:26:57,637][INFO ][cluster.service ] [es3-prod] detected_master {es1-prod}{2T2eQar7RZGwd1u1zbzs9w}{XXXX}{XXXX:9300}, reason: zen-disco-receive(from master [{es1-prod}{2T2eQar7RZGwd1u1zbzs9w}{XXXX}{XXXX:9300}])
[2018-11-12 15:27:09,082][INFO ][cluster.service ] [es3-prod] added {{es2-prod}{AynTsvb1TDmrXqF_6MBzUg}{XXXX}{XXXX:9300},}, reason: zen-disco-receive(from master [{es1-prod}{2T2eQar7RZGwd1u1zbzs9w}{XXXX}{XXXX:9300}])
[2018-11-12 15:30:27,901][WARN ][bootstrap ] Unable to lock JVM Memory: error=12,reason=Cannot allocate memory
[2018-11-12 15:30:27,902][WARN ][bootstrap ] This can result in part of the JVM being swapped out.
[2018-11-12 15:30:27,902][WARN ][bootstrap ] Increase RLIMIT_MEMLOCK, soft limit: 65536, hard limit: 65536
[2018-11-12 15:30:27,902][WARN ][bootstrap ] These can be adjusted by modifying /etc/security/limits.conf, for example:
# allow user 'elasticsearch' mlockall
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
[2018-11-12 15:30:27,902][WARN ][bootstrap ] If you are logged in interactively, you will have to re-login for the new limits to take effect.
[2018-11-12 15:30:28,073][INFO ][node ] [es3-prod] version[2.1.1], pid[3603], build[40e2c53/2015-12-15T13:05:55Z]
[2018-11-12 15:30:28,074][INFO ][node ] [es3-prod] initializing ...
[2018-11-12 15:30:28,383][INFO ][plugins ] [es3-prod] loaded [cloud-aws, delete-by-query], sites [head]
[2018-11-12 15:30:28,400][INFO ][env ] [es3-prod] using [1] data paths, mounts [[/ebs (/dev/xvdb)]], net usable_space [72.4gb], net total_space [98.3gb], spins? [no], types [ext4]
[2018-11-12 15:30:30,380][INFO ][node ] [es3-prod] initialized
[2018-11-12 15:30:30,380][INFO ][node ] [es3-prod] starting ...
[2018-11-12 15:30:30,507][WARN ][common.network ] [es3-prod] publish address: {0.0.0.0} is a wildcard address, falling back to first non-loopback: {XXXX}
[2018-11-12 15:30:30,507][INFO ][transport ] [es3-prod] publish_address {XXXX:9300}, bound_addresses {[::]:9300}
[2018-11-12 15:30:30,518][INFO ][discovery ] [es3-prod] elastic1/hXW8lE1iS02Kd-o-p7Rdvw
[2018-11-12 15:30:35,490][INFO ][cluster.service ] [es3-prod] detected_master {es1-prod}{2T2eQar7RZGwd1u1zbzs9w}{XXXX}{XXXX:9300}, added {{es1-prod}{2T2eQar7RZGwd1u1zbzs9w}{XXXX}{XXXX:9300},{es2-prod}{AynTsvb1TDmrXqF_6MBzUg}{XXXX}{XXXX:9300},{es3-prod}{a71vFKSHRaiZUO22Q9yJ6g}{XXXX}{XXXX:9300},}, reason: zen-disco-receive(from master [{es1-prod}{2T2eQar7RZGwd1u1zbzs9w}{XXXX}{XXXX:9300}])
[2018-11-12 15:30:38,570][WARN ][transport.netty ] [es3-prod] exception caught on transport layer [[id: 0xfb743cf6]], closing connection