ES 6.x: Frequent and Long Garbage Collection

I'm having a variety of performance issues w/ ES 6.1.1 since upgrading from ES 1.7.5 and I'm trying to break them down and see what I can shake out in terms of a primary cause.

ES 6.1.1
NEST 6.0.1
3 node cluster: 34 indexes, ~1200 shards, ~30M docs, ~98GB total
Windows VMs (56GB of memory w/ 24 GB for ES, 8 cores, 100'sGB available disk)

Frequent Garbage Collection
I see the GC running often and, as far as I can tell, needlessly. With 24GB of memory, the heap usage still rarely exceeds 12GB and in quieter times, will hang around 4-8GB or less used. Also, at one point, the GC spends 25s collecting. Here's a sample:

[2018-04-10T13:35:51,628][INFO ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550827] overhead, spent [515ms] collecting in the last [1s]
[2018-04-10T13:35:52,784][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550828] overhead, spent [679ms] collecting in the last [1.1s]
[2018-04-10T13:35:53,971][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550829] overhead, spent [643ms] collecting in the last [1.1s]
[2018-04-10T13:35:55,081][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550830] overhead, spent [723ms] collecting in the last [1.1s]
[2018-04-10T13:35:56,378][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550831] overhead, spent [837ms] collecting in the last [1.2s]
[2018-04-10T13:35:57,393][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550832] overhead, spent [567ms] collecting in the last [1s]
[2018-04-10T13:36:07,690][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550842] overhead, spent [618ms] collecting in the last [1s]
[2018-04-10T13:36:08,706][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550843] overhead, spent [647ms] collecting in the last [1s]
[2018-04-10T13:36:09,909][INFO ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550844] overhead, spent [572ms] collecting in the last [1.2s]
[2018-04-10T13:36:10,925][INFO ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550845] overhead, spent [420ms] collecting in the last [1s]
[2018-04-10T13:36:13,018][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550846] overhead, spent [958ms] collecting in the last [1.1s]
[2018-04-10T13:36:14,034][INFO ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550847] overhead, spent [527ms] collecting in the last [1.9s]
[2018-04-10T13:36:15,034][INFO ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550848] overhead, spent [465ms] collecting in the last [1s]
[2018-04-10T13:36:16,050][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550849] overhead, spent [565ms] collecting in the last [1s]
[2018-04-10T13:36:17,065][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550850] overhead, spent [650ms] collecting in the last [1s]
[2018-04-10T13:36:18,112][INFO ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550851] overhead, spent [514ms] collecting in the last [1s]
[2018-04-10T13:36:19,128][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550852] overhead, spent [684ms] collecting in the last [1s]
[2018-04-10T13:36:20,690][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550853] overhead, spent [910ms] collecting in the last [1.5s]
[2018-04-10T13:36:21,706][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550854] overhead, spent [552ms] collecting in the last [1s]
[2018-04-10T13:36:22,722][INFO ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550855] overhead, spent [472ms] collecting in the last [1s]
[2018-04-10T13:36:23,815][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550856] overhead, spent [622ms] collecting in the last [1s]
[2018-04-10T13:36:24,831][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550857] overhead, spent [514ms] collecting in the last [1s]
[2018-04-10T13:36:25,847][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550858] overhead, spent [576ms] collecting in the last [1s]
[2018-04-10T13:36:26,862][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550859] overhead, spent [627ms] collecting in the last [1s]
[2018-04-10T13:36:27,909][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550860] overhead, spent [629ms] collecting in the last [1s]
[2018-04-10T13:36:28,925][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550861] overhead, spent [563ms] collecting in the last [1s]
[2018-04-10T13:36:30,081][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550862] overhead, spent [640ms] collecting in the last [1.1s]
[2018-04-10T13:36:31,144][INFO ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550863] overhead, spent [520ms] collecting in the last [1s]
[2018-04-10T13:36:32,425][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550864] overhead, spent [752ms] collecting in the last [1.2s]
[2018-04-10T13:36:33,487][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550865] overhead, spent [541ms] collecting in the last [1s]
[2018-04-10T13:36:58,644][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][old][550866][76] duration [25.1s], collections [2]/[25.1s], total [25.1s]/[57.9s], memory [23.5gb]->[2.8gb]/[23.9gb], all_pools {[young] [436mb]->[105.4mb]/[532.5mb]}{[survivor] [66.5mb]->[0b]/[66.5mb]}{[old] [23gb]->[2.7gb]/[23.3gb]}
[2018-04-10T13:36:58,644][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550866] overhead, spent [25.1s] collecting in the last [25.1s]
[2018-04-10T13:36:59,785][WARN ][o.e.m.j.JvmGcMonitorService] [ES2-QUERY] [gc][550867] overhead, spent [592ms] collecting in the last [1.1s]

This doesn't look normal to me. What should I be looking for in terms of a cause for this behavior?

1 Like

Same problem as I see: Excessive garbage collection

Yeah, I noticed yours was almost simultaneous to mine. :grinning: However, as I perused yours, it looked a little different to me cuz in my case, I'm actually getting tons of memory back during the "stop the world" collections and you were reporting that you're not getting anything back during yours (check your logs for [gc][old] messages). However, for me, the world is being stopped for up to 25s which, under a high load of search requests from our web app users, is going to contribute to rejected requests, etc. Since I wrote this question, I found a good article where memory management is discussed and is very germane to our issues: Designing the Perfect Elasticsearch Cluster: the (almost) Definitive Guide.

One important point that the author makes:

The basic thing to understand with heap allocation is: the more memory you give it, the more time Java spends garbage collecting.

This definitely seems to apply to my case. I've given my nodes TONS of memory to work with. However, when it comes to time to GC, I think this may be working against me. So, given "stop the world" can be a problem, the author mentions a Java 8 GC feature:

Java 8 brings a brand new garbage collector called Garbage First, or G1, designed for heaps greater than 4GB. G1 uses background threads to divide the heap into 1 to 32MB regions, then scan the regions that contain the most garbage objects first. Elasticsearch and Lucene does not recommend using G1GC for many reasons, one of them being a nasty bug on 32bits JVM that might lead to data corruption. From an operational point of view, switching to G1GC was miraculous, leading to no more stop the world and only a few memory management issues.

So, I'm now wondering if the ES team has any additional guidance on this, given it seemed to be so successful for the author of this article. Hopefully, someone in the forum community can provide additional guidance here.

1 Like

As I mentioned in the other thread, I run logstash with G1 (and string deduplification) enabled. I only process a couple of thousand events per second, and I run with initial heap 256MB and max heap 512 MB. I haven't bothered to tweak G1 much. I get way more mixed GC cycles than I should, which I should probably tune :slight_smile:

I ingest the verbose GC logs using filebeat and run them through a logstash pipeline that parses them, so that I can visualize them using Timelion.

The cost of GC in ParNew/CMS depends on the number of GC roots. That includes objects in the old generation, so the larger the old generation, the more young generation GCs cost. If you only need 8 GB letting the heap grow to 18 GB can make performance worse, not better. A lot of people will refuse to believe that (at least in the company I work for) but it is demonstrably true if you do the experiments and look at the data.

I find the default GC options in logstash quite bizarre. They suggest +XX:+PrintTenuringDIstribution which I have very rarely found useful (trust me, you are in a very bad place if you have to adjust tenuring thresholds yourself), yet do not suggest +XX:+PrintHeapAtGC, which I consider absolutely essential when using CMS since CMS does not have Ergonomics. :confused:

Wow, that's a pretty cool trick (ie. your ingestion of the GC logs), you're pretty advanced for me.

Your point about only configuring how much heap you need is well taken. I've been reading A Heap of Trouble and seeing strong evidence that "too big" a heap might be having a significant adverse impact on my cluster. That said, determining how much heap I really need is the trick. We bumped up our VMs to get some more cores for processing requests after we were seeing an increased number of rejected requests (bulk and search). At the same we decided to up the heap size when, from everything I could see before, we were fine w/ ~12GB of heap. Given all the stuff I'm reading and your feedback, I think I'm going to scale my heap memory allocation back and see how that impacts performance.

BTW, I just got to this section in "A Heap of Trouble" on G1:

The JVM engineers have developed a concurrent garbage collector known as G1 GC that was first supported starting in JDK 7u4 and is set to be the default collector starting in JDK 96. This collector divides the heap into regions and is designed to first collect regions that are mostly garbage (hence G1: garbage first). This collector still pauses application threads when collecting, but the idea is that by focusing on regions with the most garbage, these collections will be highly efficient so that application threads need to be paused only briefly. This enables G1 GC to operate on large heaps with predictable pause times. This is exactly what we want! Unfortunately, G1 GC has exhibited bugs that lead to corruption in Lucene indices. While the older builds appear to be the ones most impacted by these issues, scary bugs are still being found even in the latest builds. We test Elasticsearch against G1 GC continuously on our CI infrastructure but at this time we recommend against and do not support running Elasticsearch with the G1 collector.

As such, I guess we'll avoid it for now but may come back it to try in our dev environments.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.