Hey Christian, thank you for the quick reply. Looking at our configuration, we noticed we had none of the GC configurations set up.
GET /_nodes/jvm?pretty
showed
"input_arguments" : [
"-Xshare:auto",
"-Des.networkaddress.cache.ttl=60",
"-Des.networkaddress.cache.negative.ttl=10",
"-XX:+AlwaysPreTouch",
"-Xss1m",
"-Djava.awt.headless=true",
"-Dfile.encoding=UTF-8",
"-Djna.nosys=true",
"-XX:-OmitStackTraceInFastThrow",
"-XX:+ShowCodeDetailsInExceptionMessages",
"-Dio.netty.noUnsafe=true",
"-Dio.netty.noKeySetOptimization=true",
"-Dio.netty.recycler.maxCapacityPerThread=0",
"-Dio.netty.allocator.numDirectArenas=0",
"-Dlog4j.shutdownHookEnabled=false",
"-Dlog4j2.disable.jmx=true",
"-Djava.locale.providers=SPI,COMPAT",
"-Xms28g",
"-Xmx28g"
]
We went ahead and copied the default configs from elasticsearch/jvm.options at 7.10 · elastic/elasticsearch · GitHub
Our JVM configs are now the following, and the cluster seems to be doing much better.
"input_arguments" : [
"-Xshare:auto",
"-Des.networkaddress.cache.ttl=60",
"-Des.networkaddress.cache.negative.ttl=10",
"-XX:+AlwaysPreTouch",
"-Xss1m",
"-Djava.awt.headless=true",
"-Dfile.encoding=UTF-8",
"-Djna.nosys=true",
"-XX:-OmitStackTraceInFastThrow",
"-XX:+ShowCodeDetailsInExceptionMessages",
"-Dio.netty.noUnsafe=true",
"-Dio.netty.noKeySetOptimization=true",
"-Dio.netty.recycler.maxCapacityPerThread=0",
"-Dio.netty.allocator.numDirectArenas=0",
"-Dlog4j.shutdownHookEnabled=false",
"-Dlog4j2.disable.jmx=true",
"-Djava.locale.providers=SPI,COMPAT",
"-Xms28g",
"-Xmx28g",
"-XX:+UseG1GC",
"-XX:G1ReservePercent=25",
"-XX:InitiatingHeapOccupancyPercent=30",
"-XX:+HeapDumpOnOutOfMemoryError",
"-XX:+HeapDumpBeforeFullGC",
"-XX:HeapDumpPath=/var/log/elasticsearch",
"-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m",
"-XX:MaxDirectMemorySize=15032385536",
"-Des.path.home=/usr/share/elasticsearch",
"-Des.path.conf=/etc/elasticsearch",
"-Des.distribution.flavor=default",
"-Des.distribution.type=rpm",
"-Des.bundled_jdk=true"
]
Looking at the convo in Garbage Collection Not Working - #5 by danielmitterdorfer
I imagine our cluster was running with a GC implicitly chosen by the JVM, which was clearly not running as needed.
Again, thanks for the help!