We have a cluster with 60 data nodes (8 CPU cores, 28 GB memory, 14 GB reserved for JVM heap) in addition to 3 master and 3 client nodes. There are around 35K shards (including replicas) in the cluster. In the past few weeks, we have had a couple of outages when heap usage of all data nodes rose quickly at around the same time and they went OOM.
Configuration details:
Output of java -version
:
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (Zulu 8.38.0.13-win64)-xxx
OpenJDK 64-Bit Server VM (Zulu 8.38.0.13-win64)-xxx
We're using G1GC instead of CMS GC because we found that it handles large heaps better. Below is the content of the jvm.options
file on all data nodes.
8:-XX:+PrintGCApplicationStoppedTime
8:-XX:+PrintGCDateStamps
8:-XX:+PrintGCDetails
8:-XX:+PrintTenuringDistribution
9-:-Djava.locale.providers=COMPAT
9-:-Xlog:gc*,gc+age=trace,safepoint:file=gc.log:utctime,pid,tags:filecount=10,filesize=100m
-Dfile.encoding=UTF-8
-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Djava.awt.headless=true
-Djava.io.tmpdir=tmp
-Djna.nosys=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Xloggc:gc.log
-Xms14336m
-Xmx14336m
-Xss1m
-XX:+AlwaysPreTouch
-XX:+HeapDumpOnOutOfMemoryError
-XX:+PrintGCTimeStamps
-XX:+UseG1GC
-XX:+UseGCLogFileRotation
-XX:GCLogFileSize=100M
-XX:NumberOfGCLogFiles=10
-XX:-OmitStackTraceInFastThrow
Issue:
Coming back to the incident, the graph below shows % heap usage of all data nodes on 19th Dec when we had an outage. It's interesting how all data nodes followed the same pattern. They all leaked memory at the same time.
We collected heap dump from one of the data nodes and analyzed it. I am pasting some screenshots of the analyzed report below:
This shows that most of the heap memory was consumed by
org.elasticsearch.transport.TransportService
. From the "Current object set" view of the analyzed data, I found that there is only one object of this class in memory.If we drill down further, we see the following:
The array of type
java.util.concurrent.ConcurrentHashMap$Node[]
has around 35K objects so I guess each element represents a shard. Each of this object consumes more than 100 MB of memory! Is this expected? Drilling down inside the 237 MB object, we see the following:The highlighted objects represent an array of
org.elasticsearch.action.admin.cluster.node.stats.NodeStats
objects. The number of elements in this array equals the number of data nodes. I am curious as to why does this shard object have 3 copies of node stats data. Not all shard objects have this pattern though.
Note that we have installed monitoring agents on each data node that execute GET _nodes/_local/stats/http,indices,jvm,transport
every minute and push the results to a monitoring channel. But this is not a new setup. We've been running with this setup for a long time now.
I understand that Elasticsearch 6.2.4 is past its EOL. We are soon going to upgrade to 6.8.5. Until then, I want to know if we're hitting any known issue here. If so, which version of Elasticsearch has a fix for it, and if not, does this bug needs fixing. Let me know if you need more details.