Miracle G1 settings for 30GB heaps

Chris_Neal · July 9, 2015, 11:39pm

Ok, maybe not miracle, but it made you look.

I'm running this version of Java:

java version "1.7.0_65"
OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

I have 30GB heaps on 64GB servers with 16 cores, and a RAID 0 stripe across 4 4TB SATA 7200 disks.
I'm indexing a consistent 30k events/sec via bulk inserts, IOps on the disks range from 100-300.

I was having very frequent young GC collections, even when my heap was only about 30% used. They were anywhere from 1-4 seconds, and happening often enough to really affect the indexing throughput, so I started looking.

I know that in general, when it comes to heap tuning, it's better to just not. The VM does a very good job, in most cases. In my searching, I came across this blog, and tried the settings they suggest, and since I enabled them, I have not seen a young GC DEBUG/INFO/WARN in my ES logs at all. It's been just over 48 hours now.

Maybe it's too early to get this excited, but I wanted to share the settings and get some comments, and hopefully they will help someone else who might be struggling with this as well.

I should also mention that my cluster is made of 13 nodes, with 3 dedicated masters, 4 dedicated clients, and 6 dedicated data nodes.

Here are the settings I'm using:

/usr/bin/java -Xms30g -Xmx30g -Xss256k 
-Djava.awt.headless=true -server 
-XX:+UseCompressedOops 
-XX:+UseG1GC 
-XX:MaxGCPauseMillis=20 
-XX:+DisableExplicitGC 
-verbose:gc 
-Xloggc:/var/log/elasticsearch/gc.log 
-XX:+PrintGCDetails 
-XX:+PrintGCDateStamps 
-XX:+HeapDumpOnOutOfMemoryError (snip)

The only one I left out was the XX:G1NewSizePercent=3 parameter, because apparently it's not valid anymore, and the VM complained (still started though).

Anyway, enough rambling. Check this out and let me know what you think (yes, I know it says for hbase )

https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase

Hope it helps.
Chris

warkolm · July 10, 2015, 1:24am

If you use G1GC you risk data loss, which is why we don't support it.

That may change, but this is the current state of play.

Chris_Neal · July 10, 2015, 1:36am

Thanks Mark.

Could you elaborate on the scenario where that might happen? Or link me to it?

Thank you sir.
Chris

warkolm · July 10, 2015, 1:55am

http://wiki.apache.org/lucene-java/JavaBugs

Do not, under any circumstances, run Lucene with the G1 garbage collector. Lucene's test suite fails with the G1 garbage collector on a regular basis, including bugs that cause index corruption. There is no person on this planet that seems to understand such bugs (see Loading..., open for over a year), so don't count on the situation changing soon. This information is not out of date, and don't think that the next oracle java release will fix the situation.

Chris_Neal · July 10, 2015, 2:09am

Thanks.

Well crap. Guess CMS is the way to go still. Any miracle CMS settings to help me get started, or just the basics?

Chris

Chris_Neal · July 10, 2015, 2:13am

Interesting though. In reading the bug history, it says this:

Hi everyone,

I am a committer to the Lucene/Solr project. We've recently hit what
we believe is a JIT/GC bug -- it manifests itself only when G1GC is
used, on a 32-bit VM:

Using Java: 32bit/jdk1.8.0-ea-b102 -server -XX:+UseG1GC
Java: 32bit/jdk1.7.0_25 -server -XX:+UseG1GC

and later:

and are consistent before and after. jdk1.7.0_04, 64-bit does NOT
exhibit the issue (and neither does any version afterwards, it only
happens on 32-bit; perhaps it's because of smaller number of available
registers and the need to spill?).

Very specific to 32 bit VM, which I am not using. No arguing that G1GC is not officially supported, but perhaps I'm in less risk of index issues than we thought?

Chris

warkolm · July 10, 2015, 2:14am

That's a risk evaluation that you need to run.

We, obviously, want our customers to avoid data loss or corruption as much as possible hence our positioning on this.

Chris_Neal · July 10, 2015, 2:15am

Understood. Thanks for all the replies.

jprante · July 10, 2015, 5:40pm

Don't panic, if you test for yourself, you can be optimistic.

Lucene committer Uwe Schindler's latest comments on Lucene and G1 can be found here: blog - devmio - Software Know-How

When observing the Lucene builds during recent months, the Lucene team noticed that the errors initially seen no longer occurred. This is also consistent with the statement by Oracle that G1GC is “ready for production” in Java 8 Update 40.

I recommend Java 8u40+, 64 bit, Red Hat Linux (no VM), and G1. Never saw a single crash or data loss because of G1 with that combination.

Chris_Neal · July 10, 2015, 5:57pm

Thanks for the info Jorg.

We're working to move to Java 8 at the moment, so I'll make sure we go for at least that update.

Chris

erenyilmaz · December 19, 2016, 1:36pm

G1GC caused exteremely high CPU load on our systems, even though there are no requests on the servers. Never, ever use G1GC with elasticsearch.

I hope the problem will be gone for Java 9.

Per_Bergland · December 25, 2016, 6:12pm

Some interesting recent development:

The specific bug (JDK-8038348) referenced on https://wiki.apache.org/lucene-java/JavaBugs next to the "Do not, under any circumstances, run Lucene with the G1 garbage collector" remark has now been resolved but nothing has changed in the official documentation and now G1 is set to be the default GC in Java 9.

jasontedor · December 26, 2016, 7:39pm

Please note that the specific corruption issue that you reference was not and is not the only reason to avoid the use of G1 with Elasticsearch. It's just that the corruption issue was a clear reason to never even consider using G1 with Elasticsearch. Any additional discussion until that point was resolved was moot. Issues remain though:

the performance tax going from CMS to G1 is quite significant due to the use of more expensive write barriers; this has a substantial impact on throughput
G1 has a larger footprint due its remembered sets and collection set

Here is one example of the impact G1 GC had on a cluster: Indexing performance degrading over time

CMS is well-understood, stable and very mature at this point. Switching to G1 GC carries a heavy cost with no apparent benefits.

You are right to point out that G1 will be the default in JDK 9, but that does not mean that it's ready for primetime. In fact, there is plenty of concern in the community that this is not the case.

jprante · December 26, 2016, 11:36pm

The motivation for G1 is going for low pause garbage collection by avoiding stop-of-the-world phases. Stop-of-the-world phases can take seconds up to minutes on CMS GC. As always, there are tradeoffs, and so is with G1.

The pros of G1 GC are

low pause garbage collecting, no stop-of-the world phase
low pauses also possible on large heaps (>8 GB)
will be supported by Oracle in Java 9+

The cons are

less throughput
it requires extra CPU cycles, and is advised to run on multi core CPUs
does not run perfectly out of the box, it requires intimate GC configuration knowledge (e.g. ‑XX:MaxGCPauseMillis)

There had been concerns in the community, but as JEP-248 JEP 248: Make G1 the Default Garbage Collector states

If a critical issue is found that can't be addressed in the JDK 9 time frame, we will revert back to use Parallel GC as the default for the JDK 9 GA.

The truth is that Oracle has turned away from CMS JEP 291: Deprecate the Concurrent Mark Sweep (CMS) Garbage Collector

To complete the picture, there has been a meeting of engineers from Google/Oracle/Twitter/SAP/jClarity for taking steps to save CMS from getting unsupported/removed from the codebase https://bugs.openjdk.java.net/secure/attachment/64150/cms-meeting-20-sep-2016.html

If anyone wants to rely on CMS during the whole JDK 9 lifecycle and beyond, let's hope that Google/SAP/Twitter/jClarity & Co. can establish methods to keep a supported and improved CMS in OpenJDK 9+.

I'll keep my fingers crossed, because Google has made patches to CMS in their Java version to improve CMS performance because of a severe CMS performance bug Loading... and they hold back other goodies JEP 291: Deprecate the Concurrent Mark Sweep (CMS) Garbage Collector

Working on the issue of long GC pauses is a hard challenge, and G1 is not the only project to tackle this. Another GC project is Shenandoah, which addresses even larger heaps of hundreds of GB, but will not be included in JDK 9 Private Site

There are also issues outside the JVM can also destroy JVM GC performance, no matter what GC is running, like /tmp being mounted on a non-tmpfs or non-tuned filesystem, which may cause Linux to block I/O for milliseconds when JVM performance stats are written to hsperfdata files because of mtime calls: Loading...

With regard to configuration findings for Elasticsearch, trust your own infrastructure, not other environments. Be open for all kinds of issues from whatever source they may come. Take metrics under your workload on your machines, measure latency, throughput, request/response times, and choose wisely.

jasontedor · December 27, 2016, 1:43pm

Thanks for the thoughtful reply @jprante.

Note that I said no "apparent" benefits, not no "claimed" benefits. Yes, the main claimed benefit of G1 is predictable garbage collection pauses. The problem is that to achieve that there is a substantial drop in throughput. For Elasticsearch, this drop in throughput translates into lower indexing rates, and higher latency serving requests. Thus, the trade here is predictable pause times for an overall worse situation; hence, no apparent benefits.

Sadly, G1 GC is a whole lot more complicated to tune than just the single knob -XX:+MaxGCPauseMillis. For example, there is the complexity of a large object allocations, a potential issue for clusters executing bulk requests with large payloads.

Topic		Replies	Views
Elasticsearch process dies when using G1GC Elasticsearch	19	1435	July 6, 2017
Anyone have G1 GC working? What environment/configs? Elasticsearch	6	715	July 6, 2017
Using G1 with Elasticsearch Elasticsearch	3	5900	November 4, 2022
Ramifications of G1GC in ES1.3 with JDK 1.8 Elasticsearch	5	428	July 6, 2017
Garbage collection Elasticsearch	13	8309	July 6, 2017

Miracle G1 settings for 30GB heaps

Related topics