G1GC In Production with Regard to Consistency

elm · April 16, 2016, 11:45am

Hello,

we are having massive problems with our ES cluster during high peaks of indexing (~2 GB/s), if forces a stop the world GC for around 40-60 seconds on each nodes once every 1-2 minutes during this time. Therefore blocking search operations and timeouts on the client side. We thought about dedicated index nodes versus G1GC. *

We went with G1GC as a first test because even with index nodes we will would have ES spending 50% of it's time on GC which seemed an inefficient use of the CPU time.

Our tests with G1GC on a replicated environment went all successful, nothing crashed no GC of longer than 300ms and no functionality problems as far as we can discover.

However, I read in an article on Lucene why they recommend against G1GC are random failures on index consistency. And I was wondering if this would be noticeable for us via an Exception, or if this can mean just corrupt data without us ever noticing? Is there any experience / ideas on this?

So do we have to do extensive tests on comparing actual index content bit by bit (ok, doc by doc) and running the same aggregations against both to make sure results are consistent, or if no exceptions are thrown from ES / Lucene than we can assume the consistency is OK?

So far we didn't see any inconsistencies and at least query results are exactly the same.

Details on our setup:

ES 2.3.1
60GB memory, 30GB heap
Oracle 1.8.0_77
-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=10 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1HeapRegionSize=32M

Thanks,
Elmar

rusty · April 16, 2016, 12:43pm

Please take a look at our experience with G1GC (link) and some related issue in the same topic ([link] (Indexing performance degrading over time)). This is not data corruption but other strange side effect of this GC. As for us we switched back to CMS and the problem with overload node gone.

warkolm · April 16, 2016, 8:17pm

We are still seeing data corruption/loss with G1GC and currently neither recommend or support its use.

elm · April 17, 2016, 8:16am

Thanks for the feedback, if you say "seeing data corruption/loss", is this happening quietly without any errors in the log or if it happens it will be noticeable by appropriate messages?

warkolm · April 17, 2016, 8:28am

https://bugs.openjdk.java.net/browse/JDK-8148175 is one example.

Topic		Replies	Views
Whats the status of Elasticsearch and G1GC? Elasticsearch	2	1665	January 10, 2018
5.x: Garbage Collection Recommendations Elasticsearch	2	2079	February 17, 2017
Anyone have G1 GC working? What environment/configs? Elasticsearch	6	694	July 6, 2017
Where can I find info on the status of G1GC in Lucene and ES? Elasticsearch	8	6634	November 27, 2018
Elasticsearch G1GC tuning Memory increases continuously Elasticsearch	6	438	October 4, 2022

G1GC In Production with Regard to Consistency

Related topics