Lots of Garbage collection logs in elasticsearch 5.6

kayshu · March 27, 2020, 11:06am

Hi, I am seeing a lot of gc overhead spent messages in the logs of a single node ES, there's also a log with total [858ms]/[21.7m]. How do I read these GCMonitor logs? What does this total [858ms]/[21.7m] means ?, What could be the cause of this?

We have about ~200 indices each with 1 shard, and total heap given to ES is 9 GB

Logs

    INFO   | jvm 1    | 2020/03/02 07:30:39 | [2020-03-02T07:30:39,456][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][219234] overhead, spent [664ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 07:30:40 | [2020-03-02T07:30:40,472][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][219235] overhead, spent [410ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 07:30:41 | [2020-03-02T07:30:41,644][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][219236] overhead, spent [583ms] collecting in the last [1.1s]
INFO   | jvm 1    | 2020/03/02 07:30:42 | [2020-03-02T07:30:42,659][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][219237] overhead, spent [321ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 07:30:43 | [2020-03-02T07:30:43,831][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][219238][10908] duration [858ms], collections [1]/[1.1s], total [858ms]/[21.7m], memory [4.7gb]->[4.4gb]/[7.7gb], all_pools {[young] [355.3mb]->[17.1mb]/[865.3mb]}{[survivor] [10.3mb]->[12.2mb]/[108.1mb]}{[old] [4.4gb]->[4.4gb]/[6.7gb]}

This is happening continously!

INFO   | jvm 1    | 2020/03/02 17:10:26 | [2020-03-02T17:10:26,023][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253338] overhead, spent [264ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:11:23 | [2020-03-02T17:11:23,244][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253392] overhead, spent [488ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:11:36 | [2020-03-02T17:11:36,448][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253405] overhead, spent [530ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:11:52 | [2020-03-02T17:11:52,683][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][253420][12474] duration [1.2s], collections [1]/[2s], total [1.2s]/[28.4m], memory [4.3gb]->[3.6gb]/[7.7gb], all_pools {[young] [732.8mb]->[2.5mb]/[865.3mb]}{[survivor] [83.9mb]->[89.8mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:11:52 | [2020-03-02T17:11:52,683][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253420] overhead, spent [1.2s] collecting in the last [2s]
INFO   | jvm 1    | 2020/03/02 17:12:21 | [2020-03-02T17:12:20,996][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][253447][12476] duration [1.4s], collections [1]/[1.9s], total [1.4s]/[28.5m], memory [4.3gb]->[3.6gb]/[7.7gb], all_pools {[young] [709.3mb]->[2mb]/[865.3mb]}{[survivor] [69.3mb]->[108.1mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:12:21 | [2020-03-02T17:12:20,996][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253447] overhead, spent [1.4s] collecting in the last [1.9s]
INFO   | jvm 1    | 2020/03/02 17:12:33 | [2020-03-02T17:12:33,169][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253459] overhead, spent [328ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:13:03 | [2020-03-02T17:13:03,076][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253488] overhead, spent [551ms] collecting in the last [1.5s]
INFO   | jvm 1    | 2020/03/02 17:19:06 | [2020-03-02T17:19:06,568][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253846] overhead, spent [433ms] collecting in the last [1.2s]
INFO   | jvm 1    | 2020/03/02 17:19:20 | [2020-03-02T17:19:20,553][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][253859][12492] duration [1.7s], collections [1]/[1.8s], total [1.7s]/[28.5m], memory [4.4gb]->[3.6gb]/[7.7gb], all_pools {[young] [860.7mb]->[28mb]/[865.3mb]}{[survivor] [74.3mb]->[75.3mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:19:20 | [2020-03-02T17:19:20,553][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253859] overhead, spent [1.7s] collecting in the last [1.8s]
INFO   | jvm 1    | 2020/03/02 17:21:41 | [2020-03-02T17:21:41,795][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253997] overhead, spent [311ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:22:41 | [2020-03-02T17:22:41,595][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254056] overhead, spent [600ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:24:11 | [2020-03-02T17:24:11,835][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254145] overhead, spent [352ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:25:33 | [2020-03-02T17:25:33,603][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][254224][12503] duration [1.8s], collections [1]/[2.6s], total [1.8s]/[28.6m], memory [4.3gb]->[3.5gb]/[7.7gb], all_pools {[young] [817.8mb]->[6.3mb]/[865.3mb]}{[survivor] [37.9mb]->[25.9mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:25:33 | [2020-03-02T17:25:33,603][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254224] overhead, spent [1.8s] collecting in the last [2.6s]
INFO   | jvm 1    | 2020/03/02 17:26:57 | [2020-03-02T17:26:57,935][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][254304][12508] duration [4s], collections [1]/[4.1s], total [4s]/[28.7m], memory [4.4gb]->[3.5gb]/[7.7gb], all_pools {[young] [861.8mb]->[15.2mb]/[865.3mb]}{[survivor] [22.2mb]->[5.2mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:26:57 | [2020-03-02T17:26:57,935][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254304] overhead, spent [4s] collecting in the last [4.1s]
INFO   | jvm 1    | 2020/03/02 17:27:13 | [2020-03-02T17:27:13,435][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254319] overhead, spent [646ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:29:27 | [2020-03-02T17:29:27,551][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][254450][12523] duration [2.2s], collections [1]/[2.3s], total [2.2s]/[28.7m], memory [4.4gb]->[3.6gb]/[7.7gb], all_pools {[young] [825.5mb]->[25.1mb]/[865.3mb]}{[survivor] [40.7mb]->[86.5mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:29:27 | [2020-03-02T17:29:27,551][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254450] overhead, spent [2.2s] collecting in the last [2.3s]
INFO   | jvm 1    | 2020/03/02 17:29:36 | [2020-03-02T17:29:36,302][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][254458][12524] duration [1.6s], collections [1]/[1.6s], total [1.6s]/[28.8m], memory [4.5gb]->[3.5gb]/[7.7gb], all_pools {[young] [865.3mb]->[9.6mb]/[865.3mb]}{[survivor] [86.5mb]->[7.4mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:29:36 | [2020-03-02T17:29:36

rugenl · March 27, 2020, 2:12pm

How much data / disk used for indices on this node?

kayshu · March 27, 2020, 2:21pm

we have 211 indices and total disk used is 920GB. The Garbage collection is continously happening.

rugenl · March 27, 2020, 2:28pm

My first guess is that fielddata is hitting it's throttle point, but that isn't much disk used for that. I don't recall from 5.6, but in later versions, indices.breaker.fielddata.limit is 40%, so when I see GC's triggering near 40% of heap, this is what I suspect.

You would have to research how much field data you have/need and what usage patterns for on this node, as in search, update, indexing, etc.

kayshu · March 27, 2020, 2:36pm

thank you for the quick reply. can you please explain what do you mean by field data is hitting its throttle point, but that isn't much disk used for that. btw indices.breaker.fielddata.limit defaults to 60% of JVM heap in ES 5.6

Also, what do you mean by GC's triggering near 40% of heap? Can fielddata cause lots of garbage collection ?

rugenl · March 27, 2020, 3:57pm

Elastic "rule of thumb" is that a node can handle about 5Tb of data, you are under 1Tb.

Heap is used for different things, one is "field data". When field data reaches the "breaker" limit, it starts GC for that type of data.

Back to finding out what function of elastic is driving the heap usage, search, index, something. What does the heap graph look like in Kibana? Is it going over 95% (or the setting of total.limit)

rugenl · March 27, 2020, 4:03pm

Also could you share output of "_cat/nodes?v&s=name" and tell us which node has the issue?

kayshu · March 27, 2020, 4:07pm

i am not using anykind of monitoring , so i cant be sure if its going over the limit.. We will trying to enable monitoring and gather more info.

Also from this log ,[gc][young][253420][12474] duration [1.2s], collections [1]/[2s], total [1.2s]/[28.4m], memory [4.3gb]->[3.6gb]/[7.7gb], all_pools {[young] [732.8mb]->[2.5mb]/[865.3mb]}{[survivor] [83.9mb]->[89.8mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]},
it looks like GC is not recovering much memory back?

We only have one node right now in this setup.

kayshu · March 27, 2020, 5:08pm

How do I read these logs, I can't seem find any doc which explains it properly?

[2020-03-04T09:15:31,951][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][66471][2453] duration [4.1s], collections [1]/[4.7s], total [4.1s]/[1.5m], memory [3.4gb]->[2.8gb]/[7.7gb], all_pools {[young] [620.8mb]->[4.4mb]/[865.3mb]}{[survivor] [108.1mb]->[79.3mb]/[108.1mb]}{[old] [2.7gb]->[2.7gb]/[6.7gb]}

[2020-03-04T09:15:31,951][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][66471] overhead, spent [4.1s] collecting in the last [4.7s]

[2020-03-04T09:17:15,424][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][66572][2459] duration [1.2s], collections [1]/[2s], total [1.2s]/[1.6m], memory [3.5gb]->[2.7gb]/[7.7gb], all_pools {[young] [723.9mb]->[1.4mb]/[865.3mb]}{[survivor] [77.5mb]->[34mb]/[108.1mb]}{[old] [2.7gb]->[2.7gb]/[6.7gb]}

Christian_Dahlqvist · March 27, 2020, 5:48pm

That looks very slow given the heap size and amount garbage collected? Do you by any chance have swap enabled? If you are deployed on VMs, do you know if memory balooning is used?

kayshu · March 27, 2020, 6:12pm

As its a windows machine and runs together with one more application, we cant disable paging. Will adding bootstrap.memory_lock: true help? How can we further debug this garbage collection issue? This issue keeps recurring and hence wanted to get some guidance.

Also, I am not sure about memory ballooning , we need to check this.

Christian_Dahlqvist · April 15, 2020, 6:22am

That should hopefully help speed up GC. I do however not run Elasticsearch on Windows so may not be able to help with other types of tuning.

system · May 13, 2020, 6:22am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch fills the heap then spends all its time doing garbage collection Elasticsearch	5	5357	July 6, 2017
Elasticsearch JvmGcMonitorService overhead Elasticsearch	10	3845	February 8, 2019
Long running GC, cluster status RED, only few GB's data Elasticsearch	12	2569	July 5, 2017
Elasticsearch taking a long time for garbage collection Elasticsearch	6	2467	July 6, 2017
Garbage collection log messages, [monitor.jvm ... duration [2.2m] Elasticsearch	8	11039	July 6, 2017

Lots of Garbage collection logs in elasticsearch 5.6

Related topics