We have been having issues lately with our production cluster. All of our grafana alerts will trigger with an error like so:
Error = [sse.dataQueryError] failed to execute query [A]: Post "https://elastic.prod2-obsv-eastus.domain.com:9200/_msearch?max_concurrent_shard_requests=5": context deadline exceeded
and queries will stop working in elasticsearch during this time. we also see blocks of time in kibana where there are gaps for metrics
looking through the logs there isn't anything super helpful that I've found so far. I did find these logs on one of our master nodes that is not currently the master:
{"@timestamp":"2024-06-03T15:00:54.430Z", "log.level": "WARN", "message":"health check of [/usr/share/elasticsearch/data] took [5402ms] which is above the warn threshold of [5s]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[prod2-elasticsearch-data-0][generic][T#73]","log.logger":"org.elasticsearch.monitor.fs.FsHealthService","elasticsearch.cluster.uuid":"owBcUi5vSDORDcxfpHp3xw","elasticsearch.node.id":"-jdGcGXhSzeJUhtKgIhwzQ","elasticsearch.node.name":"prod2-elasticsearch-data-0","elasticsearch.cluster.name":"elasticsearch"}
{"@timestamp":"2024-06-03T13:54:42.658Z", "log.level": "WARN", "message":"caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/172.19.56.147:9200, remoteAddress=/172.19.0.33:41099}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[prod2-elasticsearch-data-0][transport_worker][T#4]","log.logger":"org.elasticsearch.http.AbstractHttpServerTransport","elasticsearch.cluster.uuid":"owBcUi5vSDORDcxfpHp3xw","elasticsearch.node.id":"-jdGcGXhSzeJUhtKgIhwzQ","elasticsearch.node.name":"prod2-elasticsearch-data-0","elasticsearch.cluster.name":"elasticsearch","error.type":"io.netty.handler.codec.PrematureChannelClosureException","error.message":"Channel closed while still aggregating message","error.stack_trace":"io.netty.handler.codec.PrematureChannelClosureException: Channel closed while still aggregating message\n\tat io.netty.codec@4.1.94.Final/io.netty.handler.codec.MessageAggregator.channelInactive(MessageAggregator.java:436)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)\n\tat io.netty.codec.http@4.1.94.Final/io.netty.handler.codec.http.HttpContentDecoder.channelInactive(HttpContentDecoder.java:235)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)\n\tat org.elasticsearch.transport.netty4@8.12.0/org.elasticsearch.http.netty4.Netty4HttpHeaderValidator.channelInactive(Netty4HttpHeaderValidator.java:186)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)\n\tat io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)\n\tat io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)\n\tat org.elasticsearch.transport.netty4@8.12.0/org.elasticsearch.transport.netty4.Netty4WriteThrottlingHandler.channelInactive(Netty4WriteThrottlingHandler.java:109)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)\n\tat io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)\n\tat io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)\n\tat io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1085)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)\n\tat io.netty.common@4.1.94.Final/io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)\n\tat io.netty.common@4.1.94.Final/io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)\n\tat io.netty.common@4.1.94.Final/io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)\n\tat io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)\n\tat io.netty.common@4.1.94.Final/io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.common@4.1.94.Final/io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n"}
could either of these be related to what we are seeing? is there anything else I could be looking for that might be causing this?
As far as I've seen the cluster is staying green the whole time.