Transport errors between elasticsearch nodes

Hi !

I am creating this topic to seek help about a major issues on our Elasticsearch cluster.
We have a cluster with nearly 150 nodes (quite a bit :wink: )

We are sometime encountering a big issues, some nodes start to spam error in logs like :

{"@timestamp":"2023-06-22T12:06:32.511Z", "log.level": "WARN", "message":"sending transport message [Request{indices:admin/seq_no/global_checkpoint_sync[r]}{387009}{false}{false}{false}] of size [420] on [Netty4TcpChannel{localAddress=/172.17.0.2:47418, remoteAddress=siceventsindex55/10.0.2.75:9300, profile=default}] took [5066ms] which is above the warn threshold of [5000ms] with success [true]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[siceventsindex58][transport_worker][T#20]","log.logger":"org.elasticsearch.transport.OutboundHandler","elasticsearch.cluster.uuid":"5wU1Sl1eRpyp8aZC_i0BHA","elasticsearch.node.id":"xaqrb8Q2T7ei9aVxzfhgVg","elasticsearch.node.name":"siceventsindex58","elasticsearch.cluster.name":"siceventsindex"}

{"@timestamp":"2023-06-22T12:00:49.723Z", "log.level": "WARN", "message":"sending transport message [Response{774866237}{false}{true}{false}{class org.elasticsearch.action.bulk.BulkShardResponse}] of size [728] on [Netty4TcpChannel{localAddress=/172.17.0.2:9300, remoteAddress=/10.0.2.104:37928, profile=default}] took [12108ms] which is above the warn threshold of [5000ms] with success [true]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[siceventsindex58][transport_worker][T#17]","log.logger":"org.elasticsearch.transport.OutboundHandler","elasticsearch.cluster.uuid":"5wU1Sl1eRpyp8aZC_i0BHA","elasticsearch.node.id":"xaqrb8Q2T7ei9aVxzfhgVg","elasticsearch.node.name":"siceventsindex58","elasticsearch.cluster.name":"siceventsindex"}

The class involved is not always the same. When this issue occurs, the indexing performance drop and we need to restart some nodes to "fix" this issues.
We tried to no result to find the root cause of these logs / issues.

Thanks for your time and help :slight_smile:

What version are you using?

These messages either indicate a blocked network or a blocked IO thread. See these docs for more information.

If it's a blocked IO thread then you will find the culprit by taking a stack dump using jstack just before the log message happens. In practice that means you need to run jstack every few seconds until the problem happens.

We are currently using Elasticsearch 8.4.3.
We already had these issues on Elasticsearch 7.x

Thank you for the documentation.
I will try to do the dump with jstack and come back with more information.

Thank you !

1 Like

I realized a jstack dump just before there was some errors.

I don't really know how to analyze it.
It dumped it in a pastebin : pastebin id : j8MaS8ay (password : JLR7td2y6b)

Thanks for your help !

Looks like RestGetRollupCapsAction and RestGetRollupIndexCapsAction invoke expensive GetRollupCapsAction on transport threads · Issue #92179 · elastic/elasticsearch · GitHub to me.

Oh thanks you !

Do you know if there could be a workaround to avoid this issue, waiting for a fix ?

I don't see an easy workaround. The best I can think of would be to run an extra coordinating-only node to which you send all GET /_rollup/data/{id} and GET /{index}/_rollup/data requests. That way the harm that these requests might do is isolated to that single node and won't affect your indexing or searches.

Thanks for your help !

We didn't even know there was a rollup job in our cluster.
It was created for a test a few months back and not used...
We did some cleaning :wink:

I will keep in touch if it solve our issues ... or not :slight_smile:

Hi David !

After we removed all rollup indices from our cluster, we still see the errors with rollup request in stack trace.
Could you have idea about the origin of these requests ?

Thank you for your time !

Hi,

We identified the source of these requests to be some internal work from kibana.
We tried to disable all rollup functionality from kibana using this parameter :

xpack.rollup.ui.enabled: false

It disabled the UI part but not the internal request.

We finally used a proxy between kibana and elasticsearch which return a 404 for all requests including rollup in its path.
This fixed our issue and since then we had no incident due to transport errors.

Nice idea, thanks for closing the loop. The issue I linked above is now closed, with the fix expected to land in 8.10.0, so when you upgrade you should be able to remove your proxy workaround again

Might be worth raising this with the Kibana folks too, I would hope they can add a way to turn off these requests at source.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.