Every few days, my cluster gets into a bad state where a few of the nodes are disconnected (seems like the master removes them).
Right at the same time they are removed, the node shows this in their logs:
[2018-09-20T08:25:14,454][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-8] [gc][old][208889][624] duration [1.3m], collections [1]/[1.3m], total [1.3m]/[29.1m], memory [23.9gb]->[24gb]/[24.9gb], all_pools {[young] [55mb]->[60.1mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[ol$
If my understanding is correct a 1.3m duration is way too long. Typically these logs show <1 second but every few days they start getting into the minutes long and then everything comes collapsing down.