I am creating this topic to seek help about a major issues on our Elasticsearch cluster.
We have a cluster with nearly 150 nodes (quite a bit )
We are sometime encountering a big issues, some nodes start to spam error in logs like :
{"@timestamp":"2023-06-22T12:06:32.511Z", "log.level": "WARN", "message":"sending transport message [Request{indices:admin/seq_no/global_checkpoint_sync[r]}{387009}{false}{false}{false}] of size [420] on [Netty4TcpChannel{localAddress=/172.17.0.2:47418, remoteAddress=siceventsindex55/10.0.2.75:9300, profile=default}] took [5066ms] which is above the warn threshold of [5000ms] with success [true]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[siceventsindex58][transport_worker][T#20]","log.logger":"org.elasticsearch.transport.OutboundHandler","elasticsearch.cluster.uuid":"5wU1Sl1eRpyp8aZC_i0BHA","elasticsearch.node.id":"xaqrb8Q2T7ei9aVxzfhgVg","elasticsearch.node.name":"siceventsindex58","elasticsearch.cluster.name":"siceventsindex"}
{"@timestamp":"2023-06-22T12:00:49.723Z", "log.level": "WARN", "message":"sending transport message [Response{774866237}{false}{true}{false}{class org.elasticsearch.action.bulk.BulkShardResponse}] of size [728] on [Netty4TcpChannel{localAddress=/172.17.0.2:9300, remoteAddress=/10.0.2.104:37928, profile=default}] took [12108ms] which is above the warn threshold of [5000ms] with success [true]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[siceventsindex58][transport_worker][T#17]","log.logger":"org.elasticsearch.transport.OutboundHandler","elasticsearch.cluster.uuid":"5wU1Sl1eRpyp8aZC_i0BHA","elasticsearch.node.id":"xaqrb8Q2T7ei9aVxzfhgVg","elasticsearch.node.name":"siceventsindex58","elasticsearch.cluster.name":"siceventsindex"}
The class involved is not always the same. When this issue occurs, the indexing performance drop and we need to restart some nodes to "fix" this issues.
We tried to no result to find the root cause of these logs / issues.
These messages either indicate a blocked network or a blocked IO thread. See these docs for more information.
If it's a blocked IO thread then you will find the culprit by taking a stack dump using jstack just before the log message happens. In practice that means you need to run jstack every few seconds until the problem happens.
I don't see an easy workaround. The best I can think of would be to run an extra coordinating-only node to which you send all GET /_rollup/data/{id} and GET /{index}/_rollup/data requests. That way the harm that these requests might do is isolated to that single node and won't affect your indexing or searches.
After we removed all rollup indices from our cluster, we still see the errors with rollup request in stack trace.
Could you have idea about the origin of these requests ?
We identified the source of these requests to be some internal work from kibana.
We tried to disable all rollup functionality from kibana using this parameter :
xpack.rollup.ui.enabled: false
It disabled the UI part but not the internal request.
We finally used a proxy between kibana and elasticsearch which return a 404 for all requests including rollup in its path.
This fixed our issue and since then we had no incident due to transport errors.
Nice idea, thanks for closing the loop. The issue I linked above is now closed, with the fix expected to land in 8.10.0, so when you upgrade you should be able to remove your proxy workaround again
Might be worth raising this with the Kibana folks too, I would hope they can add a way to turn off these requests at source.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.