Cluster Blackout

Recently one of my nodes which was running on a 2g memory set for both Xms and Xmx, went crazy doing gc:

[2018-01-21T21:29:05,958][WARN ][o.e.m.j.JvmGcMonitorService] [NVMBD2BFM70V03] [gc][young][4109765][392598] duration [13.5s], collections [1]/[3.9s], total [13.5s]/[50.5m], memory [1.4gb]->[1.5gb]/[1.9gb], all_pools {[young] [456.8mb]->[543.3mb]/[546.1mb]}{[survivor] [9mb]->[9mb]/[68.2mb]}{[old] [1gb]->[1gb]/[1.3gb]}
[2018-01-21T21:29:07,500][WARN ][o.e.m.j.JvmGcMonitorService] [NVMBD2BFM70V03] [gc][4109765] overhead, spent [13.5s] collecting in the last [3.9s]

The setup I have consists of 9 nodes:

ip    heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
x            4          97   4    0.85    0.45     0.36 di        -      x
x            2          66   3    0.54    0.31     0.30 mi        -      x
x           14          99   9    0.52    0.69     0.88 d         -      x
x           13          99   4    0.71    0.41     0.35 di        -      x
x            2          67   3    0.78    0.69     0.55 mi        *      x
x            8          74  11    0.25    0.86     1.48 d         -      x
x            1          66   1    0.05    0.07     0.06 -         -      x
x            5          73   9    1.76    1.23     1.05 di        -      x
x            2          63   2    0.33    0.34     0.31 mi        -      x

Now one node as it went down should not have bought the entire cluster down, however when I dig deep into logs I see the malfunctioning node was rightly evicted, the master node at the time had the following logs:

$ cat MACHINELEARNING-2018-01-21.log

HERE IS THE LOG (NOT ABLE TO FIT IN THE BODY HERE)

One other master eligible node also has a similar message:

$ cat MACHINELEARNING-2018-01-21.log
[2018-01-21T21:34:01,076][INFO ][o.e.c.s.ClusterService   ] [NVMBD2BFL90V01] removed {{NVMBD2BFM70V03}{RKBjq_CxTFOAteqtLgHZwQ}{laMJDPWjQDWO5dKH07i-uw}{10.141.172.110}{10.141.172.110:9300},}, reason: zen-disco-receive(from master [master {NVMBD2BFL90V02}{oljvKtCNSbOHdNC_gJI12g}{VB7Ctc4FSgihC9hY1wrWNA}{10.141.172.106}{10.141.172.106:9300} committed version [259]])

The entire cluster at this moment went into hung state and remained so for the next six hours until I did a manual restart of the problematic "Crazy GC node".

What settings am I missing here?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.