Heap memory leak in Elasticsearch 6.2.4

We have a cluster with 60 data nodes (8 CPU cores, 28 GB memory, 14 GB reserved for JVM heap) in addition to 3 master and 3 client nodes. There are around 35K shards (including replicas) in the cluster. In the past few weeks, we have had a couple of outages when heap usage of all data nodes rose quickly at around the same time and they went OOM.

Configuration details:
Output of java -version:

openjdk version "1.8.0_212"
OpenJDK Runtime Environment (Zulu 8.38.0.13-win64)-xxx
OpenJDK 64-Bit Server VM (Zulu 8.38.0.13-win64)-xxx

We're using G1GC instead of CMS GC because we found that it handles large heaps better. Below is the content of the jvm.options file on all data nodes.

8:-XX:+PrintGCApplicationStoppedTime
8:-XX:+PrintGCDateStamps
8:-XX:+PrintGCDetails
8:-XX:+PrintTenuringDistribution
9-:-Djava.locale.providers=COMPAT
9-:-Xlog:gc*,gc+age=trace,safepoint:file=gc.log:utctime,pid,tags:filecount=10,filesize=100m
-Dfile.encoding=UTF-8
-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Djava.awt.headless=true
-Djava.io.tmpdir=tmp
-Djna.nosys=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Xloggc:gc.log
-Xms14336m
-Xmx14336m
-Xss1m
-XX:+AlwaysPreTouch
-XX:+HeapDumpOnOutOfMemoryError
-XX:+PrintGCTimeStamps
-XX:+UseG1GC
-XX:+UseGCLogFileRotation
-XX:GCLogFileSize=100M
-XX:NumberOfGCLogFiles=10
-XX:-OmitStackTraceInFastThrow

Issue:
Coming back to the incident, the graph below shows % heap usage of all data nodes on 19th Dec when we had an outage. It's interesting how all data nodes followed the same pattern. They all leaked memory at the same time.
image

We collected heap dump from one of the data nodes and analyzed it. I am pasting some screenshots of the analyzed report below:


This shows that most of the heap memory was consumed by org.elasticsearch.transport.TransportService. From the "Current object set" view of the analyzed data, I found that there is only one object of this class in memory.
If we drill down further, we see the following:

The array of type java.util.concurrent.ConcurrentHashMap$Node[] has around 35K objects so I guess each element represents a shard. Each of this object consumes more than 100 MB of memory! Is this expected? Drilling down inside the 237 MB object, we see the following:

The highlighted objects represent an array of org.elasticsearch.action.admin.cluster.node.stats.NodeStats objects. The number of elements in this array equals the number of data nodes. I am curious as to why does this shard object have 3 copies of node stats data. Not all shard objects have this pattern though.

Note that we have installed monitoring agents on each data node that execute GET _nodes/_local/stats/http,indices,jvm,transport every minute and push the results to a monitoring channel. But this is not a new setup. We've been running with this setup for a long time now.

I understand that Elasticsearch 6.2.4 is past its EOL. We are soon going to upgrade to 6.8.5. Until then, I want to know if we're hitting any known issue here. If so, which version of Elasticsearch has a fix for it, and if not, does this bug needs fixing. Let me know if you need more details.

[Pinging for response]

I don't know if it's a known thing or what is happening but I did some quick maths and it looks like you have twice the number of shards we are normally recommending.

We recommend at most around 20 shards per gb of heap. You are around 40 shards per gb.

You should also think of going to 6.8 or better to 7.5. If you go that way, you should also update the jvm version IMHO.

But I let experts saying more about your findings.

1 Like