Elastic Version: 6.5.4
ES cluster has 16 ingest nodes with -Xms8g -Xmx8g. We have around 120 fluentd pushing the data to ES cluster.
The ingest nodes are getting restarted with java.lang.OutOfMemoryError: Java heap space
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:06:05.757Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][1244670] overhead, spent [1.9m] collecting in the last [1.9m]"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:06:37.985Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][old][1244671][21569] duration [13s], collections [1]/[13s], total [13s]/[1h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [16.5mb]->[16.5mb]/[16.6mb]}{[old] [7.8gb]->[7.8gb]/[7.8gb]}"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:07:10.053Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][1244671] overhead, spent [13s] collecting in the last [13s]"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:08:14.418Z","logger":"o.e.d.z.ZenDiscovery","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"not enough master nodes discovered during pinging (found [[]], but needed [1]), pinging again"}
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid12.hprof ...
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:08:40.213Z","logger":"o.e.d.z.UnicastZenPing","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"failed to send ping to [{apaas-belk-elasticsearch-master-59bf85b856-6pmb8}{CGnn671kS9i2LGUqn7S38g}{Kl7y_1B7SS68y6JfWDsAaQ}{172.16.146.72}{172.16.146.72:9300}]"}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [apaas-belk-elasticsearch-master-59bf85b856-6pmb8][172.16.146.72:9300][internal:discovery/zen/unicast] request_id [1247437] timed out after [122227ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1038) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:11:09.329Z","logger":"o.e.t.TransportService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"Transport response handler not found of id [799190]"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:11:22.288Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][old][1244672][21576] duration [1.9m], collections [7]/[1.6m], total [1.9m]/[1h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [16.5mb]->[16.6mb]/[16.6mb]}{[old] [7.8gb]->[7.8gb]/[7.8gb]}"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:11:22.288Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][1244672] overhead, spent [1.9m] collecting in the last [1.6m]"}
Heap dump file created [10860628498 bytes in 102.668 secs]
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:18:08.456Z","logger":"o.e.d.z.UnicastZenPing","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"failed to send ping to [{apaas-belk-elasticsearch-master-59bf85b856-6pmb8}{CGnn671kS9i2LGUqn7S38g}{Kl7y_1B7SS68y6JfWDsAaQ}{172.16.146.72}{172.16.146.72:9300}]"}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [apaas-belk-elasticsearch-master-59bf85b856-6pmb8][172.16.146.72:9300][internal:discovery/zen/unicast] request_id [1247439] timed out after [348447ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1038) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:20:56.705Z","logger":"o.e.d.z.ZenDiscovery","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"not enough master nodes discovered during pinging (found [[]], but needed [1]), pinging again"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:17:16.939Z","logger":"o.e.d.z.UnicastZenPing","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"failed to send ping to [{apaas-belk-elasticsearch-master-59bf85b856-6pmb8}{CGnn671kS9i2LGUqn7S38g}{Kl7y_1B7SS68y6JfWDsAaQ}{172.16.146.72}{172.16.146.72:9300}]"}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [apaas-belk-elasticsearch-master-59bf85b856-6pmb8][172.16.146.72:9300][internal:discovery/zen/unicast] request_id [1247438] timed out after [484549ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1038) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:21:16.729Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][old][1244673][21613] duration [8.9m], collections [37]/[9.1m], total [8.9m]/[1.2h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [16.6mb]->[16.6mb]/[16.6mb]}{[old] [7.8gb]->[7.8gb]/[7.8gb]}"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:21:16.730Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][1244673] overhead, spent [8.9m] collecting in the last [9.1m]"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:22:47.280Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][old][1244674][21632] duration [4.6m], collections [19]/[5.7m], total [4.6m]/[1.3h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [16.6mb]->[16.6mb]/[16.6mb]}{[old] [7.8gb]->[7.8gb]/[7.8gb]}"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:23:58.034Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][1244674] overhead, spent [4.6m] collecting in the last [5.7m]"}
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "elasticsearch[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8][[unicast_connect]][T#39211]"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "elasticsearch[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8][management][T#7]"
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T03:15:26.980Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][old][1244676][21854] duration [52.3m], collections [222]/[52m], total [52.3m]/[2.1h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [16.6mb]->[16.6mb]/[16.6mb]}{[old] [7.8gb]->[7.8gb]/[7.8gb]}"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T03:16:43.966Z","logger":"o.e.t.TransportService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"Transport response handler not found of id [799215]"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T03:18:16.975Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"[gc][1244676] overhead, spent [52.3m] collecting in the last [52m]"}
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T02:56:10.911Z","logger":"o.e.d.z.UnicastZenPing","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"failed to resolve host [apaas-belk-elasticsearch-discovery]"}
java.lang.OutOfMemoryError: Java heap space
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "elasticsearch[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8][[unicast_connect]][T#39209]"
{"type":"log","host":"apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8","level":"WARN","systemid":"e7c84e34d38a49e1ae639a5dab455af5","system":"BELK","time": "2021-02-18T03:19:57.055Z","logger":"o.e.t.TransportService","timezone":"UTC","marker":"[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8] ","log":"Transport response handler not found of id [799255]"}
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "elasticsearch[apaas-belk-elasticsearch-client-75b6f4d58d-ncwr8][[unicast_connect]][T#39212]"
Why "java.lang.OutOfMemoryError: Java heap space" is seen instead of circuit breaker exception and node should not restart.