Recently one of my nodes which was running on a 2g memory set for both Xms and Xmx, went crazy doing gc:
[2018-01-21T21:29:05,958][WARN ][o.e.m.j.JvmGcMonitorService] [NVMBD2BFM70V03] [gc][young][4109765][392598] duration [13.5s], collections [1]/[3.9s], total [13.5s]/[50.5m], memory [1.4gb]->[1.5gb]/[1.9gb], all_pools {[young] [456.8mb]->[543.3mb]/[546.1mb]}{[survivor] [9mb]->[9mb]/[68.2mb]}{[old] [1gb]->[1gb]/[1.3gb]}
[2018-01-21T21:29:07,500][WARN ][o.e.m.j.JvmGcMonitorService] [NVMBD2BFM70V03] [gc][4109765] overhead, spent [13.5s] collecting in the last [3.9s]
The setup I have consists of 9 nodes:
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
x 4 97 4 0.85 0.45 0.36 di - x
x 2 66 3 0.54 0.31 0.30 mi - x
x 14 99 9 0.52 0.69 0.88 d - x
x 13 99 4 0.71 0.41 0.35 di - x
x 2 67 3 0.78 0.69 0.55 mi * x
x 8 74 11 0.25 0.86 1.48 d - x
x 1 66 1 0.05 0.07 0.06 - - x
x 5 73 9 1.76 1.23 1.05 di - x
x 2 63 2 0.33 0.34 0.31 mi - x
Now one node as it went down should not have bought the entire cluster down, however when I dig deep into logs I see the malfunctioning node was rightly evicted, the master node at the time had the following logs:
$ cat MACHINELEARNING-2018-01-21.log
HERE IS THE LOG (NOT ABLE TO FIT IN THE BODY HERE)
One other master eligible node also has a similar message:
$ cat MACHINELEARNING-2018-01-21.log
[2018-01-21T21:34:01,076][INFO ][o.e.c.s.ClusterService ] [NVMBD2BFL90V01] removed {{NVMBD2BFM70V03}{RKBjq_CxTFOAteqtLgHZwQ}{laMJDPWjQDWO5dKH07i-uw}{10.141.172.110}{10.141.172.110:9300},}, reason: zen-disco-receive(from master [master {NVMBD2BFL90V02}{oljvKtCNSbOHdNC_gJI12g}{VB7Ctc4FSgihC9hY1wrWNA}{10.141.172.106}{10.141.172.106:9300} committed version [259]])
The entire cluster at this moment went into hung state and remained so for the next six hours until I did a manual restart of the problematic "Crazy GC node".
What settings am I missing here?