G1GC In Production with Regard to Consistency

Hello,

we are having massive problems with our ES cluster during high peaks of indexing (~2 GB/s), if forces a stop the world GC for around 40-60 seconds on each nodes once every 1-2 minutes during this time. Therefore blocking search operations and timeouts on the client side. We thought about dedicated index nodes versus G1GC. *

We went with G1GC as a first test because even with index nodes we will would have ES spending 50% of it's time on GC which seemed an inefficient use of the CPU time.

Our tests with G1GC on a replicated environment went all successful, nothing crashed no GC of longer than 300ms and no functionality problems as far as we can discover.

However, I read in an article on Lucene why they recommend against G1GC are random failures on index consistency. And I was wondering if this would be noticeable for us via an Exception, or if this can mean just corrupt data without us ever noticing? Is there any experience / ideas on this?

So do we have to do extensive tests on comparing actual index content bit by bit (ok, doc by doc) and running the same aggregations against both to make sure results are consistent, or if no exceptions are thrown from ES / Lucene than we can assume the consistency is OK?

So far we didn't see any inconsistencies and at least query results are exactly the same.

  • Details on our setup:
  • ES 2.3.1
  • 60GB memory, 30GB heap
  • Oracle 1.8.0_77
    -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=10 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1HeapRegionSize=32M

Thanks,
Elmar

Please take a look at our experience with G1GC (link) and some related issue in the same topic ([link] (Indexing performance degrading over time)). This is not data corruption but other strange side effect of this GC. As for us we switched back to CMS and the problem with overload node gone.

We are still seeing data corruption/loss with G1GC and currently neither recommend or support its use.

Thanks for the feedback, if you say "seeing data corruption/loss", is this happening quietly without any errors in the log or if it happens it will be noticeable by appropriate messages?

https://bugs.openjdk.java.net/browse/JDK-8148175 is one example.