Using G1 with Elasticsearch

Background

Oracle introduced the Garbage First (G1) Garbage Collector (G1GC) in Java 7 as an alternative to the Concurrent Mark-Sweep (CMS) Garbage Collector and used by Elasticsearch. G1 is intended for server-side applications like Elasticsearch (which uses CMS by default) running on the JVM that tend to run with significantly large heaps.

When CMS was written, not many applications were taking advantage of nearly as much Java Heap Space as many applications now take for granted. As a result, CMS is often times not able to appropriately cope with the large heaps of modern applications. This can result in long GC-related "stop the world" pauses that, in the case of Elasticsearch with very large heaps, could potentially cause a Node to drop in and out of a Cluster.

With any application in this scenario, this can become a problem in production when it is to late to recover because of the long pauses. And this is exactly the problem that G1GC hopes to solve.

Oracle has supported G1 since Java 7 update 4 and they continue to support it as their non-standard garbage collector for server applications with large heaps. In short, G1 was designed to avoid the long pauses (or "interruptions" as they refer to them in the linked description of G1GC above) that currently occur with the default garbage collector.

So why not switch to G1?

Unfortunately, the story is not so blindly one-sided as it seems. While Oracle has stated that the future of G1 is to replace CMS, they have not yet made the switch.

At Elasticsearch, it is our belief that the reason G1 is not the default garbage collector is because it is not ready for production use. We do not take this position lightly, rather we have taken this position because of Elasticsearch's test framework as well as Apache Lucene's test framework.

Both projects are separately and constantly built running multiple versions of Java to ensure compatibility. Critically, both projects run some of their builds with G1 as the enabled garbage collector. Due to that, we use these builds as a barometer to determine if we are able to claim to support that variant.

Example Issues

Through our build environments, we have witnessed multiple, hard-to-repeat issues that only appear when running G1GC including:

  • Segmentation Faults
  • Unexpected Hangs
  • Unexpected Failures (this one actually reproduced)

OpenJDK Bugs reported via the above testing framework:

JDK releases notes continuously have many ominous related fixes (search for "G1"):

Raw JDK 8 Changesets:

As with any software, not all issues are created equal, but there are still too many to safely endorse it.

Do not use G1 yet, but maybe soon

We are very excited with the prospect of G1 and we are even aware of multiple clusters that successfully run with G1. Even so, we still cannot safely condone usage of G1 because of the failures that we are aware of, in spite of those successes.

As new Java updates are released, we are constantly revisiting our position on G1 and we ere excited to notice less instability with each release. For that reason, we look forward to the day that we can suggest that our users switch to using G1. That is just not today.

3 Likes

Anyone have played with ElasticSearch 1.6, jdk 1.8 and G1 ? Is it still not recommened ?

What is the current position of Elastic about to use G1 collector?

New tests were made?

How many time have you runned your successful tests?

If my cluster configuration and environment are equal to your success clusters, why can i not use G1?