We have a cluster with 30 hot nodes (12TB of data, 1051 shards).
This morning some of the hot nodes stopped responding to the master and the cluster froze.
We have seen this before, when a hot node's heap goes too high, the node blocks doing GC and the cluster becomes unresponsive.
What we usually do to solve this is to restart the nodes with high heap and the issue is solved (then we add more nodes so the issue doesn't happen again soon).
Nonetheless, this time the heap of the hot nodes went to 99% as soon as we restarted them, so we think there is something wrong with the data (an issue in a shard that triggers a bug?).
The reason I've tought it may be a bug is because a moment ago I created other 3 nodes and started spreading the shards of of one of the problematic nodes in those 3.
Suddenly, one of the new nodes' heap went to 99% and has remained there.
Here the graph showing the jump in heap usage shortly after the new node was created (a c5.4xlarge in AWS):
Could you advice on how to troubleshoot this? Here is what I see in the nodes that are having issues, a lot of GC (this one node has 30 shards worth 385.82 GB of data. The maximum shard size in this node is 21GB)
[2019-06-23T01:09:22,155][INFO ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][old][8332][6076] duration [5.3s], collections [1]/[5.3s], total [5.3s]/[5.4h], memory [14.7gb]->[12.9gb]/[
15.1gb], all_pools {[young] [635.6mb]->[47.4mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.1gb]->[12.9gb]/[14.1gb]}
[2019-06-23T01:09:22,155][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8332] overhead, spent [5.3s] collecting in the last [5.3s]
[2019-06-23T01:09:30,047][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8336] overhead, spent [4.2s] collecting in the last [4.8s]
[2019-06-23T01:09:33,760][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8337] overhead, spent [3.6s] collecting in the last [3.7s]
[2019-06-23T01:09:38,906][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8338] overhead, spent [4.3s] collecting in the last [5.1s]
[2019-06-23T01:09:41,522][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8339] overhead, spent [2.5s] collecting in the last [2.6s]
[2019-06-23T01:09:46,805][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8340] overhead, spent [4.6s] collecting in the last [5.2s]
[2019-06-23T01:09:51,150][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8341] overhead, spent [3.7s] collecting in the last [4.3s]
[2019-06-23T01:09:55,236][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8342] overhead, spent [3.9s] collecting in the last [4s]
[2019-06-23T01:09:59,595][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8343] overhead, spent [3.7s] collecting in the last [4.3s]
[2019-06-23T01:10:03,745][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8344] overhead, spent [4s] collecting in the last [4.1s]
[2019-06-23T01:10:07,769][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8345] overhead, spent [3.6s] collecting in the last [4s]
[2019-06-23T01:10:11,753][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8346] overhead, spent [3.5s] collecting in the last [3.9s]
[2019-06-23T01:10:15,641][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8347] overhead, spent [3.7s] collecting in the last [3.8s]
[2019-06-23T01:10:19,511][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8348] overhead, spent [3.6s] collecting in the last [3.8s]
[2019-06-23T01:10:24,385][WARN ][o.e.m.j.JvmGcMonitorService] [MYPREFIX-050505] [gc][8349] overhead, spent [4.7s] collecting in the last [4.8s]
At this moment, even though all nodes are up, the cluster's ingestion rate is less than 1/20 of what it usually is (I see many time outs for ingesting data). And some commands like the "node stats" are taking long to be answered by the cluster.
Resolution:
Finally found the issue, it was caused by a client doing a flurry of requests to the cluster. So it was like a DoS attack. I thought ES 6.4.1 would not accept too many connections if its heap couldn't handle them? Anyways, we fixed the issue and will upgrade to ES 7.2 that has better circuit-breakers, hoping for that to avoid errors that look like "DoS attacks" in the future.