We have 10 nodes based on ES 5.6.4, and each node 8Core, 8GB memory. The cluster only have one index with 10 shards 1 replica. Each shard has around 180GB data (single big shard is a history issue).
One day, cluster got a few bulk reject errors, each node's heap memory used up to around 80%, after manually triggered old gc, the memory still cannot be reduced. We dumped one of the node's memory, and rolling restarted all the nodes, cluster got recovered, and each node's memory usage stable at around 20%.
From the dumped file, we found most of the memory used by netty pool cache.
257 netty pool chunks, each chunk is 16MB, total around 4GB:
Here is the GC root path of byte, only kept strong reference:
Why the bulk thread local buffer cache cannot be released? Any idea about the huge memory consumption?