Elasticsearch JvmGcMonitorService overhead

Hi, we are having ELK 5.4.3 one node, one daily indice, timeseries data with no changes in them. One indice has around 25 GB data and it's stored in 2 shards.

Few times a day I can see this warning in elasticsearch log:

[2019-01-07T09:35:25,984][WARN ][o.e.m.j.JvmGcMonitorService] [Bpmy3wE] [gc][young][7635766] 
[3393290] duration [2.2s], collections [1]/[2.6s], total [2.2s]/[1.9d], memory [8.4gb]->[8.4gb]/[12.3gb], 
all_pools {[young] [2.1mb]->[1mb]/[865.3mb]}{[survivor] [99.5mb]->[101.9mb]/[108.1mb]}{[old] 
[8.3gb]->[8.3gb]/[11.4gb]}
[2019-01-07T09:35:25,984][WARN ][o.e.m.j.JvmGcMonitorService] [Bpmy3wE] [gc][7635766] overhead, 
spent [2.2s] collecting in the last [2.6s]
[2019-01-07T09:38:29,660][WARN ][o.e.m.j.JvmGcMonitorService] [Bpmy3wE] [gc][old][7635767][784] 
duration [3m], collections [1]/[3m], total [3m]/[1.1h], memory [8.4gb]->[2.5gb]/[12.3gb], all_pools 
{[young] [1mb]->[9.9mb]/[865.3mb]}{[survivor] [101.9mb]->[0b]/[108.1mb]}{[old] [8.3gb]-> 
[2.5gb]/[11.4gb]}
[2019-01-07T09:38:29,660][WARN ][o.e.m.j.JvmGcMonitorService] [Bpmy3wE] [gc][7635767] overhead, 
spent [3m] collecting in the last [3m]
 [2019-01-07T09:41:36,064][INFO ][o.e.m.j.JvmGcMonitorService] [Bpmy3wE] [gc][young][7635951] 
[3393561] duration [1.4s], collections [2]/[2.2s], total [1.4s]/[1.9d], memory [8.8gb]->[8.4gb]/[12.3gb], 
all_pools {[young] [398mb]->[4.8mb]/[865.3mb]}{[survivor] [108.1mb]->[64.2mb]/[108.1mb]}{[old] 
[8.3gb]->[8.4gb]/[11.4gb]}
 [2019-01-07T09:41:36,064][WARN ][o.e.m.j.JvmGcMonitorService] [Bpmy3wE] [gc][7635951] overhead, 
spent [1.4s] collecting in the last [2.2s]
[2019-01-07T10:09:12,518][INFO ][o.e.c.m.MetaDataMappingService] [Bpmy3wE] [exceeded- 
2019.01.07/Op9v4qCiSy2CakCtRlRApA] update_mapping [sbc_event]
[2019-01-07T10:52:15,846][INFO ][o.e.m.j.JvmGcMonitorService] [Bpmy3wE] [gc][7640173] overhead, 
spent [439ms] collecting in the last [1.2s]
 [2019-01-07T12:13:15,629][INFO ][o.e.c.m.MetaDataDeleteIndexService] [Bpmy3wE] [logstash- 
2019.01.03/pwdQEa96Rai9bXSaxWY5aA] deleting index
[2019-01-07T12:13:32,873][INFO ][o.e.c.m.MetaDataDeleteIndexService] [Bpmy3wE] [exceeded- 
2019.01.04/aE68UAMcR_e8vf1RQHA1sg] deleting index
[2019-01-07T12:13:34,780][INFO ][o.e.c.m.MetaDataDeleteIndexService] [Bpmy3wE] [collectd- 
2019.01.04/AfQVhKohRGmRZnKdnOuU3Q] deleting index
[2019-01-07T15:01:55,541][INFO ][o.e.m.j.JvmGcMonitorService] [Bpmy3wE] [gc][7654993] overhead,      
spent [252ms] collecting in the last [1s]

Elastic (and kibana) is so slow to work with. No problem with disk space, heap size is set to

 -Xmx13g
 -Xms13g

Any idea what else to check? Maybe segmnetation? Elasticsearch can't handle too much data income? Segmentation has default values.

Sometimes also node dies:

 [2019-01-08T11:40:06,424][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [9tKeRnI] failed to 
 execute on node [9tKeRnIXTZKr4kNLIyol2A]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [9tKeRnI][127.0.0.1:9300] 
[cluster:monitor/nodes/stats[n]] request_id [502158] timed out after [35292ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:934) 
[elasticsearch-5.4.3.jar:5.4.3]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.3.jar:5.4.3]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
[2019-01-08T11:40:22,619][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [9tKeRnI] not 
accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [9tKeRnIXTZKr4kNLIyol2A]
    at  org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) [elasticsearch-5.4.3.jar:5.4.3]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) [elasticsearch-5.4.3.jar:5.4.3]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:218) [elasticsearch-5.4.3.jar:5.4.3]
    at 

org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:933) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadConte
xt.java:569) [elasticsearch-5.4.3.jar:5.4.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [9tKeRnI][127.0.0.1:9300]
[cluster:monitor/nodes/stats[n]] request_id [502158] timed out after [35292ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:934) ~
[elasticsearch-5.4.3.jar:5.4.3]
... 4 more

It looks like your GC is very slow even though you have a quite small heap. Are you by any chance running this on a VM with memory ballooning, resulting in a lot of swapping?

Yes, it's VM but swapping is disabled

Have you checked whether the VM itself is overcommitted with respect to memory? Where are you running this VM?

I don't think so

vm.nr_overcommit_hugepages = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50

cat /proc/sys/vm/overcommit_memory is zero also.

Any other idea what to check? Hot threads command doesn't return anything or

 1.8% (9.1ms out of 500ms) cpu usage by thread 'elasticsearch[9tKeRnI][bulk][T#4]'
 10/10 snapshots sharing following 39 elements
   sun.nio.ch.FileDispatcherImpl.force0(Native Method)
   sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
   sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388)
   org.elasticsearch.index.translog.Checkpoint.write(Checkpoint.java:131)
   org.elasticsearch.index.translog.TranslogWriter.writeCheckpoint(TranslogWriter.java:312)
   org.elasticsearch.index.translog.TranslogWriter.syncUpTo(TranslogWriter.java:273)
   org.elasticsearch.index.translog.Translog.ensureSynced(Translog.java:520)
   org.elasticsearch.index.translog.Translog.ensureSynced(Translog.java:545)
   org.elasticsearch.index.shard.IndexShard$1.write(IndexShard.java:1684)
   org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:107)
  org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcess(AsyncIOProcessor.java:99)
   org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:82)
   org.elasticsearch.index.shard.IndexShard.sync(IndexShard.java:1706)
   org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.run(TransportWriteAction.java:307)

Where are you running this VM?

VM linux kernel, 29GB memory. It runs on KVM.

And you have verified that KVM is not configured to overcommit memory? If so, I am not sure what could cause GC to take so long.

Yes, KVM memory is not problem but it seems that the disk is too slow for writing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.