Hi all,
We constantly get Circuit Breaker errors and GC overhead logs in clusters with high indexation load. Looking at the Kibana dashboards and analyzing the logs, it is clear that the GC is one of our main bottlenecks. We tried to tune the default GC config but we had no success.
So, we'd like to get other hints from those who already had similar issues.
We are using Elasticsearch 7.10.2 running on openjdk version "15.0.1".
By default, Elasticsearch uses G1GC for this JVM version. The only JVM arg we manually set is the heap size, which is -Xms31232m -Xmx31232m.
Data nodes details:
- Number of data nodes: 3
- Heap: 32gb (-Xms31232m -Xmx31232m)
- Overall memory: 64gb
- CPUs: 16
- 86 indices
- 594 shards
- 1.5M documents
- 1.6 TB of data
Default GC config (considering the ones set by ES via jvm.options:
./jdk/bin/java -Xms31232m -Xmx31232m -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "
uint ConcGCThreads = 3 {product} {ergonomic}
size_t G1HeapRegionSize = 16777216 {product} {ergonomic}
uintx MaxGCPauseMillis = 200 {product} {default}
size_t MaxNewSize = 19646119936 {product} {ergonomic}
uintx NewRatio = 2 {product} {default}
size_t NewSize = 1363144 {product} {default}
size_t OldSize = 5452592 {product} {default}
uint ParallelGCThreads = 13 {product} {default}
Using the default JVM config, our cluster looks like this under heavy load:
The first attempt to tune GC was setting -XX:MaxGCPauseMillis=400 -XX:NewRatio=2
:
./jdk/bin/java -Xms31232m -Xmx31232m -XX:MaxGCPauseMillis=400 -XX:NewRatio=2 -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "
uint ConcGCThreads = 3 {product} {ergonomic}
size_t G1HeapRegionSize = 16777216 {product} {ergonomic}
uintx MaxGCPauseMillis = 400 {product} {command line}
size_t MaxNewSize = 10905190400 {product} {ergonomic}
uintx NewRatio = 2 {product} {command line}
size_t NewSize = 1363144 {product} {default}
size_t OldSize = 5452592 {product} {default}
uint ParallelGCThreads = 13 {product} {default}
It made our heap to be always at the peak and the old GC to be much more frequent, probably because setting NewRatio=2 made the young pool memory to be reduced (which was a surprise to us, we expected it to increase). As a consequence, lots of circuit breakers and GC overhead logs.
Then we decided to let the JVM define the young pool size ergonomically, so we manually set only XX:MaxGCPauseMillis=400.
./jdk/bin/java -Xms31232m -Xmx31232m -XX:MaxGCPauseMillis=400 -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "
uint ConcGCThreads = 3 {product} {ergonomic}
size_t G1HeapRegionSize = 16777216 {product} {ergonomic}
uintx MaxGCPauseMillis = 400 {product} {command line}
size_t MaxNewSize = 19646119936 {product} {ergonomic}
uintx NewRatio = 2 {product} {default}
size_t NewSize = 1363144 {product} {default}
size_t OldSize = 5452592 {product} {default}
uint ParallelGCThreads = 13 {product} {default}
The idea was to reduce the frequency of GCs and make them more efficient. It solved the old GC collection frequency issue, but the young GC started to be very high again. As consequence, again, lots of CBs and GC overhead.
Can someone suspect of anything else? Is there any other GC setting to take into consideration?
Thanks in advance.