We've been running a three-node ELK cluster since last fall, running elasticsearch 5.5.2, with no trouble until last week. Then Friday evening two nodes dumped their heap, which filled the root volume and locked up the nodes completely. Resized the VMs' root volumes and started it all back up and it's been bumpy since.
All three nodes page out periodically with high load averages in short bursts, sometimes as high as 350+, but never for more than five minutes. I haven't had good luck spotting the problem while it's happening, but today the whole cluster locked up despite the loads being reasonable.
If I run "top -p PID" with the PID for the ES main PID, and then hit "H," I can see a slew of child java threads running and one camped at the top with a long run time.
Trying to stop ES via systemctl times out. Eventually had to kill -9 it. Also tried kill -9 on the stuck child thread on one node, but that just killed the main process.
Restarted on all three nodes and now it's just about done cleaning up the unassigned shards and getting back to green. Hopefully redis didn't miss too much during the interruption, but I'm not clear on what happens to events that redis thinks it's passed to ES if they don't make it in.
I'm aware that this is a pitiful dearth of usable information. The logs are dense with errors mostly along the lines of "Node not connected", I don't even have the dump files from last week. But can anyone point me toward first troubleshooting steps when this happens again?
Hope to hear from you.