Transport errors between elasticsearch nodes

Josselin · June 22, 2023, 12:09pm

Hi !

I am creating this topic to seek help about a major issues on our Elasticsearch cluster.
We have a cluster with nearly 150 nodes (quite a bit )

We are sometime encountering a big issues, some nodes start to spam error in logs like :

{"@timestamp":"2023-06-22T12:06:32.511Z", "log.level": "WARN", "message":"sending transport message [Request{indices:admin/seq_no/global_checkpoint_sync[r]}{387009}{false}{false}{false}] of size [420] on [Netty4TcpChannel{localAddress=/172.17.0.2:47418, remoteAddress=siceventsindex55/10.0.2.75:9300, profile=default}] took [5066ms] which is above the warn threshold of [5000ms] with success [true]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[siceventsindex58][transport_worker][T#20]","log.logger":"org.elasticsearch.transport.OutboundHandler","elasticsearch.cluster.uuid":"5wU1Sl1eRpyp8aZC_i0BHA","elasticsearch.node.id":"xaqrb8Q2T7ei9aVxzfhgVg","elasticsearch.node.name":"siceventsindex58","elasticsearch.cluster.name":"siceventsindex"}

{"@timestamp":"2023-06-22T12:00:49.723Z", "log.level": "WARN", "message":"sending transport message [Response{774866237}{false}{true}{false}{class org.elasticsearch.action.bulk.BulkShardResponse}] of size [728] on [Netty4TcpChannel{localAddress=/172.17.0.2:9300, remoteAddress=/10.0.2.104:37928, profile=default}] took [12108ms] which is above the warn threshold of [5000ms] with success [true]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[siceventsindex58][transport_worker][T#17]","log.logger":"org.elasticsearch.transport.OutboundHandler","elasticsearch.cluster.uuid":"5wU1Sl1eRpyp8aZC_i0BHA","elasticsearch.node.id":"xaqrb8Q2T7ei9aVxzfhgVg","elasticsearch.node.name":"siceventsindex58","elasticsearch.cluster.name":"siceventsindex"}

The class involved is not always the same. When this issue occurs, the indexing performance drop and we need to restart some nodes to "fix" this issues.
We tried to no result to find the root cause of these logs / issues.

Thanks for your time and help

DavidTurner · June 22, 2023, 12:28pm

What version are you using?

These messages either indicate a blocked network or a blocked IO thread. See these docs for more information.

If it's a blocked IO thread then you will find the culprit by taking a stack dump using jstack just before the log message happens. In practice that means you need to run jstack every few seconds until the problem happens.

Josselin · June 22, 2023, 12:39pm

We are currently using Elasticsearch 8.4.3.
We already had these issues on Elasticsearch 7.x

Thank you for the documentation.
I will try to do the dump with jstack and come back with more information.

Thank you !

Josselin · June 22, 2023, 1:36pm

I realized a jstack dump just before there was some errors.

I don't really know how to analyze it.
It dumped it in a pastebin : pastebin id : j8MaS8ay (password : JLR7td2y6b)

Thanks for your help !

DavidTurner · June 22, 2023, 1:44pm

Looks like RestGetRollupCapsAction and RestGetRollupIndexCapsAction invoke expensive GetRollupCapsAction on transport threads · Issue #92179 · elastic/elasticsearch · GitHub to me.

Josselin · June 22, 2023, 2:32pm

Oh thanks you !

Do you know if there could be a workaround to avoid this issue, waiting for a fix ?

DavidTurner · June 22, 2023, 2:40pm

I don't see an easy workaround. The best I can think of would be to run an extra coordinating-only node to which you send all GET /_rollup/data/{id} and GET /{index}/_rollup/data requests. That way the harm that these requests might do is isolated to that single node and won't affect your indexing or searches.

Josselin · June 22, 2023, 3:03pm

Thanks for your help !

We didn't even know there was a rollup job in our cluster.
It was created for a test a few months back and not used...
We did some cleaning

I will keep in touch if it solve our issues ... or not

Josselin · July 20, 2023, 9:47am

Hi David !

After we removed all rollup indices from our cluster, we still see the errors with rollup request in stack trace.
Could you have idea about the origin of these requests ?

Thank you for your time !

Josselin · August 11, 2023, 8:06am

Hi,

We identified the source of these requests to be some internal work from kibana.
We tried to disable all rollup functionality from kibana using this parameter :

xpack.rollup.ui.enabled: false

It disabled the UI part but not the internal request.

We finally used a proxy between kibana and elasticsearch which return a 404 for all requests including rollup in its path.
This fixed our issue and since then we had no incident due to transport errors.

DavidTurner · August 12, 2023, 5:38am

Nice idea, thanks for closing the loop. The issue I linked above is now closed, with the fix expected to land in 8.10.0, so when you upgrade you should be able to remove your proxy workaround again

Might be worth raising this with the Kibana folks too, I would hope they can add a way to turn off these requests at source.

system · September 9, 2023, 5:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Exceptions in the log in ElasticSearch to 5.6.9 Elasticsearch	4	550	July 24, 2018
I am getting elastic_transport.node_pool WARNING when doing elastic search benchmarking using rally Elasticsearch docker	1	192	April 30, 2024
Elasticsearch shutdowns every few days with this error Elasticsearch	2	887	July 26, 2022
org.elasticsearch.action.get.MultiGetShardResponse extremly high "[took]" value Elasticsearch	4	401	February 12, 2022
Failed to connect to node [{}{}{localhost}{127.0.0.1:9300}], removed from nodes list Elasticsearch docker	5	2390	June 1, 2020

Transport errors between elasticsearch nodes

Related Topics