Lots of Garbage collection logs in elasticsearch 5.6

Hi, I am seeing a lot of gc overhead spent messages in the logs of a single node ES, there's also a log with total [858ms]/[21.7m]. How do I read these GCMonitor logs? What does this total [858ms]/[21.7m] means ?, What could be the cause of this?

We have about ~200 indices each with 1 shard, and total heap given to ES is 9 GB

Logs

    INFO   | jvm 1    | 2020/03/02 07:30:39 | [2020-03-02T07:30:39,456][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][219234] overhead, spent [664ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 07:30:40 | [2020-03-02T07:30:40,472][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][219235] overhead, spent [410ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 07:30:41 | [2020-03-02T07:30:41,644][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][219236] overhead, spent [583ms] collecting in the last [1.1s]
INFO   | jvm 1    | 2020/03/02 07:30:42 | [2020-03-02T07:30:42,659][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][219237] overhead, spent [321ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 07:30:43 | [2020-03-02T07:30:43,831][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][219238][10908] duration [858ms], collections [1]/[1.1s], total [858ms]/[21.7m], memory [4.7gb]->[4.4gb]/[7.7gb], all_pools {[young] [355.3mb]->[17.1mb]/[865.3mb]}{[survivor] [10.3mb]->[12.2mb]/[108.1mb]}{[old] [4.4gb]->[4.4gb]/[6.7gb]}

This is happening continously!

INFO   | jvm 1    | 2020/03/02 17:10:26 | [2020-03-02T17:10:26,023][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253338] overhead, spent [264ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:11:23 | [2020-03-02T17:11:23,244][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253392] overhead, spent [488ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:11:36 | [2020-03-02T17:11:36,448][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253405] overhead, spent [530ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:11:52 | [2020-03-02T17:11:52,683][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][253420][12474] duration [1.2s], collections [1]/[2s], total [1.2s]/[28.4m], memory [4.3gb]->[3.6gb]/[7.7gb], all_pools {[young] [732.8mb]->[2.5mb]/[865.3mb]}{[survivor] [83.9mb]->[89.8mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:11:52 | [2020-03-02T17:11:52,683][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253420] overhead, spent [1.2s] collecting in the last [2s]
INFO   | jvm 1    | 2020/03/02 17:12:21 | [2020-03-02T17:12:20,996][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][253447][12476] duration [1.4s], collections [1]/[1.9s], total [1.4s]/[28.5m], memory [4.3gb]->[3.6gb]/[7.7gb], all_pools {[young] [709.3mb]->[2mb]/[865.3mb]}{[survivor] [69.3mb]->[108.1mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:12:21 | [2020-03-02T17:12:20,996][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253447] overhead, spent [1.4s] collecting in the last [1.9s]
INFO   | jvm 1    | 2020/03/02 17:12:33 | [2020-03-02T17:12:33,169][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253459] overhead, spent [328ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:13:03 | [2020-03-02T17:13:03,076][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253488] overhead, spent [551ms] collecting in the last [1.5s]
INFO   | jvm 1    | 2020/03/02 17:19:06 | [2020-03-02T17:19:06,568][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253846] overhead, spent [433ms] collecting in the last [1.2s]
INFO   | jvm 1    | 2020/03/02 17:19:20 | [2020-03-02T17:19:20,553][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][253859][12492] duration [1.7s], collections [1]/[1.8s], total [1.7s]/[28.5m], memory [4.4gb]->[3.6gb]/[7.7gb], all_pools {[young] [860.7mb]->[28mb]/[865.3mb]}{[survivor] [74.3mb]->[75.3mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:19:20 | [2020-03-02T17:19:20,553][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253859] overhead, spent [1.7s] collecting in the last [1.8s]
INFO   | jvm 1    | 2020/03/02 17:21:41 | [2020-03-02T17:21:41,795][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][253997] overhead, spent [311ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:22:41 | [2020-03-02T17:22:41,595][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254056] overhead, spent [600ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:24:11 | [2020-03-02T17:24:11,835][INFO ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254145] overhead, spent [352ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:25:33 | [2020-03-02T17:25:33,603][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][254224][12503] duration [1.8s], collections [1]/[2.6s], total [1.8s]/[28.6m], memory [4.3gb]->[3.5gb]/[7.7gb], all_pools {[young] [817.8mb]->[6.3mb]/[865.3mb]}{[survivor] [37.9mb]->[25.9mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:25:33 | [2020-03-02T17:25:33,603][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254224] overhead, spent [1.8s] collecting in the last [2.6s]
INFO   | jvm 1    | 2020/03/02 17:26:57 | [2020-03-02T17:26:57,935][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][254304][12508] duration [4s], collections [1]/[4.1s], total [4s]/[28.7m], memory [4.4gb]->[3.5gb]/[7.7gb], all_pools {[young] [861.8mb]->[15.2mb]/[865.3mb]}{[survivor] [22.2mb]->[5.2mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:26:57 | [2020-03-02T17:26:57,935][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254304] overhead, spent [4s] collecting in the last [4.1s]
INFO   | jvm 1    | 2020/03/02 17:27:13 | [2020-03-02T17:27:13,435][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254319] overhead, spent [646ms] collecting in the last [1s]
INFO   | jvm 1    | 2020/03/02 17:29:27 | [2020-03-02T17:29:27,551][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][254450][12523] duration [2.2s], collections [1]/[2.3s], total [2.2s]/[28.7m], memory [4.4gb]->[3.6gb]/[7.7gb], all_pools {[young] [825.5mb]->[25.1mb]/[865.3mb]}{[survivor] [40.7mb]->[86.5mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:29:27 | [2020-03-02T17:29:27,551][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][254450] overhead, spent [2.2s] collecting in the last [2.3s]
INFO   | jvm 1    | 2020/03/02 17:29:36 | [2020-03-02T17:29:36,302][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][254458][12524] duration [1.6s], collections [1]/[1.6s], total [1.6s]/[28.8m], memory [4.5gb]->[3.5gb]/[7.7gb], all_pools {[young] [865.3mb]->[9.6mb]/[865.3mb]}{[survivor] [86.5mb]->[7.4mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]}
INFO   | jvm 1    | 2020/03/02 17:29:36 | [2020-03-02T17:29:36

How much data / disk used for indices on this node?

we have 211 indices and total disk used is 920GB. The Garbage collection is continously happening.

My first guess is that fielddata is hitting it's throttle point, but that isn't much disk used for that. I don't recall from 5.6, but in later versions, indices.breaker.fielddata.limit is 40%, so when I see GC's triggering near 40% of heap, this is what I suspect.

You would have to research how much field data you have/need and what usage patterns for on this node, as in search, update, indexing, etc.

thank you for the quick reply. can you please explain what do you mean by field data is hitting its throttle point, but that isn't much disk used for that. btw indices.breaker.fielddata.limit defaults to 60% of JVM heap in ES 5.6

Also, what do you mean by GC's triggering near 40% of heap? Can fielddata cause lots of garbage collection ?

Elastic "rule of thumb" is that a node can handle about 5Tb of data, you are under 1Tb.

Heap is used for different things, one is "field data". When field data reaches the "breaker" limit, it starts GC for that type of data.

Back to finding out what function of elastic is driving the heap usage, search, index, something. What does the heap graph look like in Kibana? Is it going over 95% (or the setting of total.limit)

Also could you share output of "_cat/nodes?v&s=name" and tell us which node has the issue?

i am not using anykind of monitoring , so i cant be sure if its going over the limit.. We will trying to enable monitoring and gather more info.

Also from this log ,[gc][young][253420][12474] duration [1.2s], collections [1]/[2s], total [1.2s]/[28.4m], memory [4.3gb]->[3.6gb]/[7.7gb], all_pools {[young] [732.8mb]->[2.5mb]/[865.3mb]}{[survivor] [83.9mb]->[89.8mb]/[108.1mb]}{[old] [3.5gb]->[3.5gb]/[6.7gb]},
it looks like GC is not recovering much memory back?

We only have one node right now in this setup.

How do I read these logs, I can't seem find any doc which explains it properly?

[2020-03-04T09:15:31,951][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][66471][2453] duration [4.1s], collections [1]/[4.7s], total [4.1s]/[1.5m], memory [3.4gb]->[2.8gb]/[7.7gb], all_pools {[young] [620.8mb]->[4.4mb]/[865.3mb]}{[survivor] [108.1mb]->[79.3mb]/[108.1mb]}{[old] [2.7gb]->[2.7gb]/[6.7gb]}

[2020-03-04T09:15:31,951][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][66471] overhead, spent [4.1s] collecting in the last [4.7s]

[2020-03-04T09:17:15,424][WARN ][o.e.m.j.JvmGcMonitorService] [es_master_essvr16] [gc][young][66572][2459] duration [1.2s], collections [1]/[2s], total [1.2s]/[1.6m], memory [3.5gb]->[2.7gb]/[7.7gb], all_pools {[young] [723.9mb]->[1.4mb]/[865.3mb]}{[survivor] [77.5mb]->[34mb]/[108.1mb]}{[old] [2.7gb]->[2.7gb]/[6.7gb]}

That looks very slow given the heap size and amount garbage collected? Do you by any chance have swap enabled? If you are deployed on VMs, do you know if memory balooning is used?

As its a windows machine and runs together with one more application, we cant disable paging. Will adding bootstrap.memory_lock: true help? How can we further debug this garbage collection issue? This issue keeps recurring and hence wanted to get some guidance.

Also, I am not sure about memory ballooning , we need to check this.

That should hopefully help speed up GC. I do however not run Elasticsearch on Windows so may not be able to help with other types of tuning.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.