Elasticsearch nodes are going to out of memory when ~100 fluentd tries to contact it

Hi,
I am using Elasticsearch 7.8.0 in k8s environment. I have 20 ES ingest nodes for serving traffic, 25 data nodes , 3 master nodes. I have around 100 fluentd's for pushing data to ES(data push is not uniform. Some days it can go up to 250gb). Is there any limit in clients reaching to ES. Because eventhough I have good memory the ES data, ingest pods are going out of memory and pods are getting restarted.
Below are the memory configuration for each of them.

ingest pod -
pod limit, request: 30Gb
cpu - 1
 -Xms15g -Xmx15g
data pod -
pod limit, request: 30Gb
 cpu - 1
 -Xms15g -Xmx15g
master pod -
pod limit, request - 2Gb
cpu - 1
 -Xms1g -Xmx1g
openjdk version "11.0.7" 2020-04-14 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.7+10-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.7+10-LTS, mixed mode)
CentOS Linux release 7.8.2003 (Core)

jvm.options -

## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
11-:-XX:+UseG1GC
11-:-XX:G1ReservePercent=25
11-:-XX:InitiatingHeapOccupancyPercent=30

## JVM temporary directory
-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

Below are the logs.

{"type":"log","host":"data-12","level":"ERROR","time":"2021-05-04T13:20:15.589Z","logger":"c.f.s.s.t.SearchGuardSSLNettyTransport","timezone":"UTC","marker":"[data-12] ","log":{"message":"Exception during establishing a SSL connection: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space"}}
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:84) [transport-netty4-client-7.8.0.jar:7.8.0]
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.handler.logging.LoggingHandler.exceptionCaught(LoggingHandler.java:205) [netty-handler-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.handler.ssl.SslHandler.exceptionCaught(SslHandler.java:1136) [netty-handler-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:381) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.49.Final.jar:4.1.49.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.OutOfMemoryError: Java heap space
        at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:281) ~[?:?]
        at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:662) ~[?:?]
        at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:672) ~[?:?]
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:247) ~[?:?]
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:227) ~[?:?]
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:147) ~[?:?]
        at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:339) ~[?:?]
        at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:168) ~[?:?]
        at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:159) ~[?:?]
        at org.elasticsearch.transport.NettyAllocator$NoDirectBuffers.heapBuffer(NettyAllocator.java:137) ~[?:?]
        at io.netty.handler.ssl.SslHandler$SslEngineType$3.allocateWrapBuffer(SslHandler.java:312) ~[?:?]
        at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:2199) ~[?:?]
        at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:840) ~[?:?]
        at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:811) ~[?:?]
        at io.netty.handler.ssl.SslHandler.handleUnwrapThrowable(SslHandler.java:1297) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1273) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]

I suggest upgrading to take advantage of the new memory-based indexing pressure mechanism. 100 clients sending 250GB/day to 25 nodes should not be a big deal.

If you're still seeing problems after upgrading then it'd be good to see the heap dump from the OOM event.

1 Like