Elasticsearch GC timeout on data node

atakjudin · July 6, 2021, 9:25am

Hello,

I try to find out an answer to my problem but I don't find a solution in existing threads I have read.

I have a cluster of 7 elasticsearch nodes (2 master, 4 ingest, all data). The latest node was added two weeks ago. All the nodes are on the same elasticsearch version (7.7).

This node is only a data node with 8vCPU and 64 Go of RAM.
When I see logs, it looks like I have a CPU issue :

[2021-07-04T04:46:30,230][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468052] overhead, spent [1.8s] collecting in the last [2.2s]
[2021-07-04T04:46:37,198][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][young][468058][116661] duration [1.5s], collections [1]/[1.9s], total [1.5s]/[1.6d], memory [10.4gb]->[9.9gb]/[16gb], all_pools {[young] [480mb]->[0b]/[0
b]}{[old] [9.9gb]->[9.9gb]/[16gb]}{[survivor] [72.9mb]->[20.5mb]/[0b]}
[2021-07-04T04:46:37,198][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468058] overhead, spent [1.5s] collecting in the last [1.9s]
[2021-07-04T04:46:43,279][INFO ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468064] overhead, spent [485ms] collecting in the last [1s]
[2021-07-04T04:46:59,641][INFO ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468080] overhead, spent [376ms] collecting in the last [1.3s]
[2021-07-04T04:47:32,702][INFO ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468113] overhead, spent [377ms] collecting in the last [1s]
[2021-07-04T04:48:26,943][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][young][468164][116683] duration [2.5s], collections [1]/[3.5s], total [2.5s]/[1.6d], memory [12.3gb]->[10.1gb]/[16gb], all_pools {[young] [2.2gb]->[0b]/[
0b]}{[old] [10gb]->[10gb]/[16gb]}{[survivor] [49.5mb]->[83.1mb]/[0b]}
[2021-07-04T04:48:26,943][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468164] overhead, spent [2.5s] collecting in the last [3.5s]
[2021-07-04T04:49:53,510][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][young][468248][116700] duration [2.3s], collections [1]/[3.3s], total [2.3s]/[1.6d], memory [10.7gb]->[10.2gb]/[16gb], all_pools {[young] [520mb]->[0b]/[0b]}{[old] [10.1gb]->[10.2gb]/[16gb]}{[survivor] [67.2mb]->[48mb]/[0b]}
[2021-07-04T04:49:53,511][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468248] overhead, spent [2.3s] collecting in the last [3.3s]
[2021-07-04T04:50:04,021][INFO ][o.e.n.Node               ] [ELKNODE4] stopping ...
[2021-07-04T04:50:05,877][WARN ][o.e.m.j.JvmGcMonitorService] [ELKNODE4] [gc][468260] overhead, spent [602ms] collecting in the last [1s]
[2021-07-04T04:50:06,243][INFO ][o.e.x.w.WatcherService   ] [ELKNODE4] stopping watch service, reason [shutdown initiated]
[2021-07-04T04:50:06,431][INFO ][o.e.x.w.WatcherLifeCycleService] [ELKNODE4] watcher has stopped and shutdown
[2021-07-04T04:50:08,723][INFO ][o.e.c.c.Coordinator      ] [ELKNODE4] master node [{MASTERNODE}{uYN4jfWaQW6UZWX22_JMdQ}{iHf-jv4VQOCw8jH8Blq2Gw}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{dilmrt}{ml.machine_memory=67354615808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] failed, restarting discovery
org.elasticsearch.transport.NodeDisconnectedException: [MASTERNODE][xx.xx.xx.xx:9300][disconnected] disconnected
[2021-07-04T04:50:09,533][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [ELKNODE4] [controller/1929] [Main.cc@150] Ml controller exiting
[2021-07-04T04:50:09,535][INFO ][o.e.x.m.p.NativeController] [ELKNODE4] Native controller process has stopped - no new native processes can be started
[2021-07-04T04:50:13,296][ERROR][i.n.u.c.D.rejectedExecution] [ELKNODE4] Failed to submit a listener notification task. Event loop shut down?

I tried to setup JVM to 16 Go and 20 Go. When I go higher, service failed to start with an error "Out of Memory".

Below some monitoring data :

I have a hard time figuring out what causes the issue.

Can someone help me please ?

Best regards

Christian_Dahlqvist · July 13, 2021, 4:50am

You should always aim to have 3 master eligible nodes in a cluster. Having just 2 is bad. Why are only 4 of the nodes ingest nodes? Are all nodes the same specification?

Do you have anything else running on these hosts or are they dedicated to Elasticsearch? What is the full output of the cluster stats API?

What type of storage are you using for the nodes? Local SSDs?

system · August 10, 2021, 4:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node "timeout" possibly due to GC? Elasticsearch	5	797	July 5, 2017
Elasticsearch operational issue due to garbage collector Elasticsearch	7	2340	October 27, 2017
GC Collection (OLD) on Data Node brings node offline Elasticsearch	7	1346	January 1, 2020
Elasticsearch data node JVM Running out of memory Elasticsearch	2	504	May 8, 2020
Elasticsearch JvmGcMonitorService overhead Elasticsearch	10	3845	February 8, 2019

Elasticsearch GC timeout on data node

Related topics