we are having massive problems with our ES cluster during high peaks of indexing (~2 GB/s), if forces a stop the world GC for around 40-60 seconds on each nodes once every 1-2 minutes during this time. Therefore blocking search operations and timeouts on the client side. We thought about dedicated index nodes versus G1GC. *
We went with G1GC as a first test because even with index nodes we will would have ES spending 50% of it's time on GC which seemed an inefficient use of the CPU time.
Our tests with G1GC on a replicated environment went all successful, nothing crashed no GC of longer than 300ms and no functionality problems as far as we can discover.
However, I read in an article on Lucene why they recommend against G1GC are random failures on index consistency. And I was wondering if this would be noticeable for us via an Exception, or if this can mean just corrupt data without us ever noticing? Is there any experience / ideas on this?
So do we have to do extensive tests on comparing actual index content bit by bit (ok, doc by doc) and running the same aggregations against both to make sure results are consistent, or if no exceptions are thrown from ES / Lucene than we can assume the consistency is OK?
So far we didn't see any inconsistencies and at least query results are exactly the same.
- Details on our setup:
- ES 2.3.1
- 60GB memory, 30GB heap
- Oracle 1.8.0_77
-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=10 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1HeapRegionSize=32M