Elasticsearch GC timeout on data node

Hello,

I try to find out an answer to my problem but I don't find a solution in existing threads I have read.

I have a cluster of 7 elasticsearch nodes (2 master, 4 ingest, all data). The latest node was added two weeks ago. All the nodes are on the same elasticsearch version (7.7).

This node is only a data node with 8vCPU and 64 Go of RAM.
When I see logs, it looks like I have a CPU issue :

[2021-07-04T04:46:30,230][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468052] overhead, spent [1.8s] collecting in the last [2.2s]
[2021-07-04T04:46:37,198][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][young][468058][116661] duration [1.5s], collections [1]/[1.9s], total [1.5s]/[1.6d], memory [10.4gb]->[9.9gb]/[16gb], all_pools {[young] [480mb]->[0b]/[0
b]}{[old] [9.9gb]->[9.9gb]/[16gb]}{[survivor] [72.9mb]->[20.5mb]/[0b]}
[2021-07-04T04:46:37,198][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468058] overhead, spent [1.5s] collecting in the last [1.9s]
[2021-07-04T04:46:43,279][INFO ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468064] overhead, spent [485ms] collecting in the last [1s]
[2021-07-04T04:46:59,641][INFO ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468080] overhead, spent [376ms] collecting in the last [1.3s]
[2021-07-04T04:47:32,702][INFO ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468113] overhead, spent [377ms] collecting in the last [1s]
[2021-07-04T04:48:26,943][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][young][468164][116683] duration [2.5s], collections [1]/[3.5s], total [2.5s]/[1.6d], memory [12.3gb]->[10.1gb]/[16gb], all_pools {[young] [2.2gb]->[0b]/[
0b]}{[old] [10gb]->[10gb]/[16gb]}{[survivor] [49.5mb]->[83.1mb]/[0b]}
[2021-07-04T04:48:26,943][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468164] overhead, spent [2.5s] collecting in the last [3.5s]
[2021-07-04T04:49:53,510][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][young][468248][116700] duration [2.3s], collections [1]/[3.3s], total [2.3s]/[1.6d], memory [10.7gb]->[10.2gb]/[16gb], all_pools {[young] [520mb]->[0b]/[0b]}{[old] [10.1gb]->[10.2gb]/[16gb]}{[survivor] [67.2mb]->[48mb]/[0b]}
[2021-07-04T04:49:53,511][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468248] overhead, spent [2.3s] collecting in the last [3.3s]
[2021-07-04T04:50:04,021][INFO ][o.e.n.Node               ] [ELKNODE4] stopping ...
[2021-07-04T04:50:05,877][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468260] overhead, spent [602ms] collecting in the last [1s]
[2021-07-04T04:50:06,243][INFO ][o.e.x.w.WatcherService   ] [ELKNODE4] stopping watch service, reason [shutdown initiated]
[2021-07-04T04:50:06,431][INFO ][o.e.x.w.WatcherLifeCycleService] [ELKNODE4] watcher has stopped and shutdown
[2021-07-04T04:50:08,723][INFO ][o.e.c.c.Coordinator      ] [ELKNODE4] master node [{MASTERNODE}{uYN4jfWaQW6UZWX22_JMdQ}{iHf-jv4VQOCw8jH8Blq2Gw}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{dilmrt}{ml.machine_memory=67354615808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] failed, restarting discovery
org.elasticsearch.transport.NodeDisconnectedException: [MASTERNODE][xx.xx.xx.xx:9300][disconnected] disconnected
[2021-07-04T04:50:09,533][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [ELKNODE4] [controller/1929] [Main.cc@150] Ml controller exiting
[2021-07-04T04:50:09,535][INFO ][o.e.x.m.p.NativeController] [ELKNODE4] Native controller process has stopped - no new native processes can be started
[2021-07-04T04:50:13,296][ERROR][i.n.u.c.D.rejectedExecution] [ELKNODE4] Failed to submit a listener notification task. Event loop shut down?

I tried to setup JVM to 16 Go and 20 Go. When I go higher, service failed to start with an error "Out of Memory".

Below some monitoring data :

I have a hard time figuring out what causes the issue.

Can someone help me please ?

Best regards

You should always aim to have 3 master eligible nodes in a cluster. Having just 2 is bad. Why are only 4 of the nodes ingest nodes? Are all nodes the same specification?

Do you have anything else running on these hosts or are they dedicated to Elasticsearch? What is the full output of the cluster stats API?

What type of storage are you using for the nodes? Local SSDs?