We have 10 nodes based on ES 5.6.4, and each node 8Core, 8GB memory. The cluster only have one index with 10 shards 1 replica. Each shard has around 180GB data (single big shard is a history issue).
One day, cluster got a few bulk reject errors, each node's heap memory used up to around 80%, after manually triggered old gc, the memory still cannot be reduced. We dumped one of the node's memory, and rolling restarted all the nodes, cluster got recovered, and each node's memory usage stable at around 20%.
From the dumped file, we found most of the memory used by netty pool cache.
I am not the right person to look at this, but I suspect it would help if you could provide your Elasticsearch config as well as any non-default JVM settings you are using.
What type of load is the cluster under? What is the use-case?
And except xms/xmx, we don't have non-default JVM settings:
-Xms7896m
-Xmx7896m
Here is the average JVM usage for each node during that time, and eacho node cpu utilization around 20%. After rolling restart nodes, jvm usage comes down.
Cluster has single index with 10 shards 1 replica:
green open lambda 68_xZphGTG6aGhXhQsjcYw 10 1 1754643122 2859 1.8tb 1.8tb
And we could see some bulk reject, and node memory cannot be reduced by old gc, it's used up by netty pool buffer as above description.
Thank you @Christian_Dahlqvist , average size of doc is 5kb, and average bulk size is 5000, the target index has 10 primary shards and each shard has one replica, total 20 shards.
Currently cluster runs correct without exception, I also dumped current memory for analysising. I found less then 10 bytes array which contains netty pool chunk.
But in previous exception case, dump one node has around 500 netty pool chunk bytes array which used up more than 4GB heap. (total heap size 8GB)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.