Hello, I have Elasticsearch installed along with Fluentbit and Kibana in our kubernetes cluster.
Of late, We have a heavy loss of logs from the kubernetes nodes to the elastic indices and there by Kibana.
I have been noticing the errors similar to below from fluent bit. It says to me that fluent bit is unable to push the chunks to the output system, in this case, Elasticsearch cluster.
[2022/06/08 13:13:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/06/08 13:13:39] [ info] [input] tail.0 resume (mem buf overlimit)
[2022/06/08 13:13:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/06/08 13:13:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2022/06/08 13:13:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/06/08 13:13:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2022/06/08 13:13:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/06/08 13:13:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2022/06/08 13:13:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/06/08 13:13:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2022/06/08 13:13:40] [ info] [input] systemd.1 resume (mem buf overlimit)
[2022/06/08 13:13:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/06/08 13:13:40] [ warn] [input] systemd.1 paused (mem buf overlimit)
[2022/06/08 13:13:42] [ info] [input] tail.0 resume (mem buf overlimit)
[2022/06/08 13:13:42] [ warn] [engine] failed to flush chunk '1-1654693979.91366988.flb', retry in 8 seconds: task_id=25, input=tail.0 > output=es.0 (out_id=0)
[2022/06/08 13:13:42] [ warn] [engine] failed to flush chunk '1-1654693940.467286805.flb', retry in 79 seconds: task_id=6, input=tail.0 > output=es.0 (out_id=0)
[2022/06/08 13:13:43] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/06/08 13:13:43] [ info] [input] tail.0 resume (mem buf overlimit)
[2022/06/08 13:13:43] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/06/08 13:13:43] [ info] [input] tail.0 resume (mem buf overlimit)
[2022/06/08 13:13:44] [ warn] [input] tail.0 paused (mem buf overlimit)
After going through many fluent bit options and trying out(increasing size of memory buffer limit, use on disk instead of the memory and other options etc.,), nothing works out. A restart of these fluent bit pods help for couple of seconds only in resulting these tail paused messages and nothing ending up in Elasticsearch indices and there by Kibana. I also want to add that elastic data node use around 75%+ of memory (it is entitled to 4GB for each data node) almost all the time. I see that elastic data nodes have a setting of
ES_JAVA_OPTS: -Xms2g -Xmx2g
So, turning to Elasticsearch master and data nodes, similar errors like below are logged periodically. What is funny is that very few flb chunks make it through indices(may be 10 %). That makes me think it is not actually to do with SSL certificate issue. It could be something else and I am unable to move forward in finding the reason for these below errors from Elasticsearch. My belief is that since elastic nodes are unable to process the messages from various fluent bit pods due to the below error, it just pauses indefinitely.
2022-06-08T13:11:39.570067756Z {"type": "server", "timestamp": "2022-06-08T13:11:39,568Z", "level": "WARN", "component": "o.e.h.AbstractHttpServerTransport", "cluster.name": "elastic-cluster", "node.name": "elastic-cluster-es-master-2", "message": "caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/10.233.118.18:9200, remoteAddress=/10.233.89.197:35874}", "cluster.uuid": "EQQ472KzSX6QcuQCs3jRuw", "node.id": "mmrOUNzpSMy8RyDP_wLHxg" ,
2022-06-08T13:11:39.570103748Z "stacktrace": ["io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?",
2022-06-08T13:11:39.570112079Z "at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:473) ~[netty-codec-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570127358Z "at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:281) ~[netty-codec-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570133826Z "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570139755Z "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570145286Z "at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570150766Z "at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570156493Z "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570162131Z "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570167811Z "at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570175198Z "at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570181085Z "at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570186779Z "at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:600) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570192321Z "at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:554) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570242831Z "at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570264241Z "at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570270499Z "at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570276101Z "at java.lang.Thread.run(Thread.java:830) [?:?]",
2022-06-08T13:11:39.570281353Z "Caused by: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?",
2022-06-08T13:11:39.570286653Z "at sun.security.ssl.SSLEngineInputRecord.bytesInCompletePacket(SSLEngineInputRecord.java:146) ~[?:?]",
2022-06-08T13:11:39.570292209Z "at sun.security.ssl.SSLEngineInputRecord.bytesInCompletePacket(SSLEngineInputRecord.java:64) ~[?:?]",
2022-06-08T13:11:39.570297813Z "at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:605) ~[?:?]",
2022-06-08T13:11:39.570302918Z "at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:499) ~[?:?]",
2022-06-08T13:11:39.570312135Z "at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:475) ~[?:?]",
2022-06-08T13:11:39.570317811Z "at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:634) ~[?:?]",
2022-06-08T13:11:39.570323185Z "at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:280) ~[netty-handler-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570328785Z "at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1332) ~[netty-handler-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570334831Z "at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227) ~[netty-handler-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570340515Z "at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274) ~[netty-handler-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570354715Z "at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:503) ~[netty-codec-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570361873Z "at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:442) ~[netty-codec-4.1.43.Final.jar:4.1.43.Final]",
2022-06-08T13:11:39.570367756Z "... 16 more"] }
This is the output part of fluent bit configuration that deals with outputting data
[OUTPUT]
Name es
Match *
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
Logstash_Format On
Replace_Dots On
Retry_Limit False
tls On
tls.verify Off
HTTP_User <>
HTTP_Passwd <>
Trace_Error On
net.keepalive Off
Can someone help understanding what is going wrong in my case. This was working well until 3 months ago and there are no upgrades at our end and it slowly started deteriorating to the current moment