Netty LEAK logs on elasticsearch clusters probably after version update

Hello,

We have two kubernetes clusters running on two DCs, on these clusters we have 2 elasticsearch clusters. These two clusters are being used by a service on our webpage to respond queries.

Cluster details per DC (old):
ECK-operator version: 1.12.1

Elasticsearch version: 8.12.0

Data/Master combined nodes count:3
CPU: 8
Memory: 8
Heap for JVM: 4

Service details:
Elastic java API client version: 8.15.5

We have two types of indices;

First type is being reindexed every hour, these indices are tiny having 20 k documents.
Second type is being reindexed every 15 minutes and around 10 million documents are being indexed within 3 minutes.

Reindex procedure is as follows, we create the new indices, populate the documents move the aliases and remove old indices. Since the data is emphemeral our development team choose this approach.

On the old cluster we had some problems, when indexing and heavy searching was going on together. But we never had issues (as far as we know).

CLUSTER UPGRADE DETAILS

We recently upgraded our eck-operator and Elasticsearch versions.

Cluster details per DC (new):
ECK-operator version: 1.16.1

Elasticsearch version: 8.13.4

Data/Master combined nodes count:3
CPU: 8
Memory: 8
Heap for JVM: 4

After the upgrade we started to see LEAK logs on data nodes, and the second type of indices were never able to be indexed anymore so the indexer for the big indices were shut down. Unfortunately, since we found this now, we dont have logs from the time of the old cluster settings to check, if LEAK logs were present there as well.

Log sample:

{"@timestamp":"2025-03-12T16:00:13.346Z", "log.level":"ERROR", "message":"LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.\nRecent access records: \nCreated at:\n\tio.netty.buffer@4.1.94.Final/io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:390)\n\tio.netty.buffer@4.1.94.Final/io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169)\n\tio.netty.buffer@4.1.94.Final/io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160)\n\torg.elasticsearch.transport.netty4@8.13.4/org.elasticsearch.transport.netty4.NettyAllocator$NoDirectBuffers.heapBuffer(NettyAllocator.java:282)\n\torg.elasticsearch.transport.netty4@8.13.4/org.elasticsearch.transport.netty4.NettyAllocator$NoDirectBuffers.buffer(NettyAllocator.java:252)\n\tio.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.allocate(SslHandler.java:2267)\n\tio.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1348)\n\tio.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1246)\n\tio.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1295)\n\tio.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)\n\tio.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)\n\tio.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652)\n\tio.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\tio.netty.common@4.1.94.Final/io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tio.netty.common@4.1.94.Final/io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tjava.base/java.lang.Thread.run(Thread.java:1583)", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[***][transport_worker][T#2]","log.logger":"io.netty.util.ResourceLeakDetector","elasticsearch.cluster.uuid":"***","elasticsearch.node.id":"W_xLbnr4SKSe4SZeSoLLsQ","elasticsearch.node.name":"***","elasticsearch.cluster.name":"***"}

with this ERROR we also started to see GC spikes and OOMs on datanodes, we caught this because elasticsearch created with eck-operator creates HeapDumps on the same data dir with elastic nodes. Which caused disk space issues.

Did anyone experienced this problem after upgrading from 8.12.0?

Regards.