I have a 3-node 5.1.1 cluster that I believe is sized correctly. In the past 2 weeks, it's locked up twice, ultimately reaching the queue capacity of 1000 and requiring a full cluster restart to bring it back online. Elastic is running inside docker containers on 3 ec2 instances in different availability zones.
I believe these are the only relevant Elastic logs from when this happened:
[2017-01-04T13:40:44,434][INFO ][o.e.m.j.JvmGcMonitorService] [es1] [gc] overhead, spent [294ms] collecting in the last [1.1s]
[2017-01-04T13:42:29,514][INFO ][o.e.c.m.MetaDataMappingService] [es1] [.monitoring-kibana-2-2017.01.04/0kTB19GeRg686SSUtC8l5g] update_mapping [kibana_stats]
[2017-01-04T13:43:33,268][WARN ][o.e.m.j.JvmGcMonitorService] [es1] [gc] overhead, spent [1.9s] collecting in the last [2.7s]
[2017-01-04T13:43:46,771][INFO ][o.e.n.Node ] [es1] initializing ...
[2017-01-04T13:43:46,841][INFO ][o.e.e.NodeEnvironment ] [es1] using  data paths, mounts [[/usr/share/elasticsearch/data (/dev/xvdf1)]], net usable_space [134.9gb], net total_space [147.5gb], spins? [possibly], types [ext4]
[2017-01-04T13:43:46,842][INFO ][o.e.e.NodeEnvironment ] [es1] heap size [6.9gb], compressed ordinary object pointers [true]
[2017-01-04T13:43:46,911][INFO ][o.e.n.Node ] [es1] node name [es1], node ID [3nX__vT2Q4aA-XjLh9m7Lw]
[2017-01-04T13:43:46,913][INFO ][o.e.n.Node ] [es1] version[5.1.1], pid, build[5395e21/2016-12-06T12:36:15.409Z], OS[Linux/4.4.35-33.55.amzn1.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_111/25.111-b14]
My Zabbix graphs show a CPU spike starting right around 13:40 on all 3 nodes. It looks like that might correspond with garbage collection from the logs, but I'm not sure. It then appears that the docker container restarted on it's own. I haven't found anything in the system logs to indicate why that might have happened.
Looking at the marvel graphs for the first (master) node at 13:40, heap was ~5GB out of 7GB. CPU was bumping around between 0 and 25% for the previous several hours. Index and search latency were low. Segment count was ~600.
Some thoughts that I've had but haven't finished exploring:
- Maybe this is caused by a rogue query? If so, any tips on tracking this down?
- The docker MTU is 1500 and the host primary network interface MTU is 9001. I've seen older references to this causing connectivity problems between nodes, but it seems like those issues might be worked out in recent version of Elastic and Docker.
I'm wondering how to troubleshoot this further as I'm not seeing anything extra in the logs. Any direction would be much appreciated.