For the very first time after upgrading to 5.6.8.2 version (in last 3 months), we observed a strange and fatal error on production in elasticsearch cluster.
What happened:
All dedicated data nodes died one after another (total of 8 nodes), most of them complained about JVM heap space. We don't know what could have caused it because everything has been working perfectly fine on the cluster for a very long time.
Elasticsearch Master Logs:
[2019-03-14T17:04:14,174][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [esmaster1] failed to execute on node [WUS7OWKBQNCRBwD7MhA1yA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [analyticsengine-esdata-b7-cluster.iad1.qprod.net][10.20.15.30:9300][cluster:monitor/nodes/stats[n]] request_id [121039465] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
[2019-03-14T17:04:14,174][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [esmaster1] failed to execute on node [0Omy4yGZQ6iyWd72R0rkjQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [esdata8][10.24.96.143:9300][cluster:monitor/nodes/stats[n]] request_id [121039467] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
[2019-03-14T17:04:24,201][WARN ][o.e.d.z.PublishClusterStateAction] [esmaster1] timed out waiting for all nodes to process published state [752608] (timeout [30s], pending nodes: [{esdata8}{0Omy4yGZQ6iyWd72R0rkjQ}{ZbNiggbiSeOH1j9YFGhBGQ}{10.24.96.143}{10.24.96.143:9300}])
Elasticsearch Data Node 7 logs:
[2019-03-14T16:54:51,508][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8679918] overhead, spent [384ms] collecting in the last [1s]
[2019-03-14T16:59:41,513][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680207] overhead, spent [685ms] collecting in the last [1.6s]
[2019-03-14T16:59:43,204][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][young][8680208][909685] duration [749ms], collections [1]/[1.6s], total [749ms]/[11.2h],
memory [17.8gb]->[18gb]/[29.8gb], all_pools {[young] [44mb]->[17.5mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [17.6gb]->[17.8gb]/[28.1gb]}
[2019-03-14T16:59:43,204][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680208] overhead, spent [749ms] collecting in the last [1.6s]
[2019-03-14T16:59:46,654][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][young][8680211][909686] duration [1.4s], collections [1]/[1.4s], total [1.4s]/[11.2h], m
emory [19.5gb]->[18.5gb]/[29.8gb], all_pools {[young] [1.4gb]->[5.7mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [17.8gb]->[18.3gb]/[28.1gb]}
[2019-03-14T16:59:46,654][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680211] overhead, spent [1.4s] collecting in the last [1.4s]
[2019-03-14T16:59:52,655][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680217] overhead, spent [280ms] collecting in the last [1s]
emory [20.6gb]->[20gb]/[29.8gb], all_pools {[young] [1.3gb]->[3.6mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [19.1gb]->[19.8gb]/[28.1gb]}
[2019-03-14T17:04:51,687][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][young][8680450][909737] duration [6.5s], collections [1]/[57.9s], total [6.5s]/[11.3h],
[2019-03-14T17:06:09,016][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680459] overhead, spent [918ms] collecting in the last [1.3s]
[2019-03-14T17:07:17,808][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][old][8680460][965] duration [1.1m], collections [1]/[1.1m], total [1.1m]/[14.9m], memory [27.9gb]->[28.4gb]/[29.8gb], all_pools {[young] [3.5mb]->[285mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.7gb]->[28.1gb]/[28.1gb]}
[2019-03-14T17:07:17,808][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680460] overhead, spent [1.1m] collecting in the last [1.1m]
[2019-03-14T17:08:30,897][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][old][8680463][966] duration [1.1m], collections [1]/[1.1m], total [1.1m]/[16m], memory [29.7gb]->[29.3gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.2gb]/[1.4gb]}{[survivor] [156.1mb]->[0b]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2019-03-14T17:08:30,897][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680463] overhead, spent [1.1m] collecting in the last [1.1m]
[2019-03-14T17:08:30,903][WARN ][o.e.c.s.ClusterService ] [esdata7] cluster state update task [zen-disco-receive(from master [master {esdata6}{JGboHUM9Sbaw7dZc8Bq56w}{WTfdxo5pTE-inTn0w541Nw}{10.24.16.99}{10.24.16.99:9300} committed version [752613]])] took [1.2m] above the warn threshold of 30s
[2019-03-14T17:09:27,122][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][old][8680464][967] duration [55.9s], collections [1]/[56.2s], total [55.9s]/[17m], memory [29.3gb]->[29.4gb]/[29.8gb], all_pools {[young] [1.2gb]->[1.3gb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2019-03-14T17:09:27,122][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680464] overhead, spent [55.9s] collecting in the last [56.2s]
[2019-03-14T17:10:38,320][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][old][8680465][968] duration [1.1m], collections [1]/[1.1m], total [1.1m]/[18.2m], memory
As you can see, the heap went up from 18G to 30G within minutes. This happened on all nodes. Is there any known bug in Elasticsearch that could lead to this situation?
See below for additional stack trace on data node