Hello,
I try to find out an answer to my problem but I don't find a solution in existing threads I have read.
I have a cluster of 7 elasticsearch nodes (2 master, 4 ingest, all data). The latest node was added two weeks ago. All the nodes are on the same elasticsearch version (7.7).
This node is only a data node with 8vCPU and 64 Go of RAM.
When I see logs, it looks like I have a CPU issue :
[2021-07-04T04:46:30,230][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468052] overhead, spent [1.8s] collecting in the last [2.2s]
[2021-07-04T04:46:37,198][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][young][468058][116661] duration [1.5s], collections [1]/[1.9s], total [1.5s]/[1.6d], memory [10.4gb]->[9.9gb]/[16gb], all_pools {[young] [480mb]->[0b]/[0
b]}{[old] [9.9gb]->[9.9gb]/[16gb]}{[survivor] [72.9mb]->[20.5mb]/[0b]}
[2021-07-04T04:46:37,198][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468058] overhead, spent [1.5s] collecting in the last [1.9s]
[2021-07-04T04:46:43,279][INFO ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468064] overhead, spent [485ms] collecting in the last [1s]
[2021-07-04T04:46:59,641][INFO ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468080] overhead, spent [376ms] collecting in the last [1.3s]
[2021-07-04T04:47:32,702][INFO ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468113] overhead, spent [377ms] collecting in the last [1s]
[2021-07-04T04:48:26,943][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][young][468164][116683] duration [2.5s], collections [1]/[3.5s], total [2.5s]/[1.6d], memory [12.3gb]->[10.1gb]/[16gb], all_pools {[young] [2.2gb]->[0b]/[
0b]}{[old] [10gb]->[10gb]/[16gb]}{[survivor] [49.5mb]->[83.1mb]/[0b]}
[2021-07-04T04:48:26,943][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468164] overhead, spent [2.5s] collecting in the last [3.5s]
[2021-07-04T04:49:53,510][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][young][468248][116700] duration [2.3s], collections [1]/[3.3s], total [2.3s]/[1.6d], memory [10.7gb]->[10.2gb]/[16gb], all_pools {[young] [520mb]->[0b]/[0b]}{[old] [10.1gb]->[10.2gb]/[16gb]}{[survivor] [67.2mb]->[48mb]/[0b]}
[2021-07-04T04:49:53,511][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468248] overhead, spent [2.3s] collecting in the last [3.3s]
[2021-07-04T04:50:04,021][INFO ][o.e.n.Node ] [ELKNODE4] stopping ...
[2021-07-04T04:50:05,877][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468260] overhead, spent [602ms] collecting in the last [1s]
[2021-07-04T04:50:06,243][INFO ][o.e.x.w.WatcherService ] [ELKNODE4] stopping watch service, reason [shutdown initiated]
[2021-07-04T04:50:06,431][INFO ][o.e.x.w.WatcherLifeCycleService] [ELKNODE4] watcher has stopped and shutdown
[2021-07-04T04:50:08,723][INFO ][o.e.c.c.Coordinator ] [ELKNODE4] master node [{MASTERNODE}{uYN4jfWaQW6UZWX22_JMdQ}{iHf-jv4VQOCw8jH8Blq2Gw}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{dilmrt}{ml.machine_memory=67354615808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] failed, restarting discovery
org.elasticsearch.transport.NodeDisconnectedException: [MASTERNODE][xx.xx.xx.xx:9300][disconnected] disconnected
[2021-07-04T04:50:09,533][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [ELKNODE4] [controller/1929] [Main.cc@150] Ml controller exiting
[2021-07-04T04:50:09,535][INFO ][o.e.x.m.p.NativeController] [ELKNODE4] Native controller process has stopped - no new native processes can be started
[2021-07-04T04:50:13,296][ERROR][i.n.u.c.D.rejectedExecution] [ELKNODE4] Failed to submit a listener notification task. Event loop shut down?
I tried to setup JVM to 16 Go and 20 Go. When I go higher, service failed to start with an error "Out of Memory".
Below some monitoring data :
I have a hard time figuring out what causes the issue.
Can someone help me please ?
Best regards