Hi Everyone,
We meet some JVM memory issue in 3 of our warm nodes,the JVM heap of this nodes keep increasing and up to 99%,and then the following error message are reported in the node log file.and then,the cluster seems to stop working as we also see timeout from kibana seeing that connect to ES timeout.
This situation will not resolved until we force restart the ES service in that node.
Currently we are running at ES 6.4.2 docker container with about 31G JVM heap for a node ,as I understand we could not add more JVM memory for that node.
Does anybody can help on that?
[2019-01-27T05:01:14,534][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:01:14,535][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5497] overhead, spent [26.8s] collecting in the last [27.8s]
[2019-01-27T05:01:45,021][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5498][465] duration [29.4s], collections [1]/[30.4s], total [29.4s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [994.1mb]->[993.9mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.8gb]/[27.9gb]}
[2019-01-27T05:01:45,020][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:01:45,021][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5498] overhead, spent [29.4s] collecting in the last [30.4s]
[2019-01-27T05:02:14,244][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:02:14,618][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5499][466] duration [28.6s], collections [1]/[29.5s], total [28.6s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [993.9mb]->[995.2mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.8gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:02:14,618][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5499] overhead, spent [28.6s] collecting in the last [29.5s]
[2019-01-27T05:02:44,673][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:02:44,674][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5500][467] duration [29.1s], collections [1]/[30s], total [29.1s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [995.2mb]->[998.7mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:02:44,674][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5500] overhead, spent [29.1s] collecting in the last [30s]
[2019-01-27T05:03:14,461][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:03:14,462][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5501][468] duration [28.8s], collections [1]/[29.7s], total [28.8s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [998.7mb]->[992.9mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:03:14,462][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5501] overhead, spent [28.8s] collecting in the last [29.7s]
[2019-01-27T05:03:45,099][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:03:45,100][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5502][469] duration [29.7s], collections [1]/[30.6s], total [29.7s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [992.9mb]->[994.7mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:03:45,100][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5502] overhead, spent [29.7s] collecting in the last [30.6s]
[2019-01-27T05:04:13,513][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5503][470] duration [27.4s], collections [1]/[28.4s], total [27.4s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [994.7mb]->[997.2mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:04:13,513][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5503] overhead, spent [27.4s] collecting in the last [28.4s]
[2019-01-27T05:04:13,513][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:04:44,178][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:04:44,179][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5504][471] duration [29.7s], collections [1]/[30.6s], total [29.7s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [997.2mb]->[1002.9mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:04:44,179][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5504] overhead, spent [29.7s] collecting in the last [30.6s]