Failed to indices:data/write/bulk[s] on replica because of Netty4TcpChannel / CompositeBytesReference more than 2GB

Martin_Berlin · December 21, 2023, 1:20pm

While Indexing to our Cluster sometimes this error occures turning the cluster in red & yellow state:

One node is trying to "perform indices:data/write/bulk[s] on replica" on another node but fails because of "exception caught on transport layer [Netty4TcpChannel...closing connection java.lang.IllegalArgumentException: CompositeBytesReference cannot hold more than 2GB"

We did not set any specific "network.tcp.send_buffer_size" or "network.tcp.receive_buffer_size" and the system values are:
sysctl -a | grep rmem
net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.ipv4.tcp_rmem = 4096 131072 6291456
net.ipv4.udp_rmem_min = 4096

Any ideas? Google searches for this problem resulted only in Source code results...

Full Error node writing/sending:

[WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210854}{false}{true}{false}] of size [669561038] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9023ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,800][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210881}{false}{true}{false}] of size [690408] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9014ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210885}{false}{true}{false}] of size [2319] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9014ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210888}{false}{true}{false}] of size [343027] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9011ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210892}{false}{true}{false}] of size [2248] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9010ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210896}{false}{true}{false}] of size [2416] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9004ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210899}{false}{true}{false}] of size [2776] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [8754ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210906}{false}{true}{false}] of size [3078] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [8200ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210909}{false}{true}{false}] of size [3647] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [8200ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,809][INFO ][o.e.t.ClusterConnectionManager] [cluster_II_node_6] transport connection to [{cluster_II_node_2}{L6USlqkNSAKjUKJ_iNC3yA}{Sd-dtRctQBWKfn2OdHvadQ}{cluster_II_node_2}{10.10.1.31}{10.10.1.31:9300}{d}{8.11.1}{7000099-8500003}] closed by remote
[2023-12-21T12:47:12,810][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210912}{false}{true}{false}] of size [19389305] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [8037ms] which is above the warn threshold of [5000ms] with success [false]
[2023-12-21T12:47:18,705][WARN ][o.e.a.b.TransportShardBulkAction] [cluster_II_node_6] [[cluster_node_2023_12_20_16_46_55][44]] failed to perform indices:data/write/bulk[s] on replica [cluster_node_2023_12_20_16_46_55][44], node[L6USlqkNSAKjUKJ_iNC3yA], [R], s[STARTED], a[id=a5hzS6-1TmGG4qHt3_GQxw], failed_attempts[0]
org.elasticsearch.transport.NodeNotConnectedException: [cluster_II_node_2][10.10.1.31:9300] Node not connected
	at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:869) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:764) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:750) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1272) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:303) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:111) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:481) ~[elasticsearch-8.11.1.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1583) ~[?:?]
	Suppressed: org.elasticsearch.transport.NodeDisconnectedException: [cluster_II_node_2][10.10.1.31:9300][indices:data/write/bulk[s][r]] disconnected
	Suppressed: org.elasticsearch.transport.NodeNotConnectedException: [cluster_II_node_2][10.10.1.31:9300] Node not connected
		at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:869) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:764) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:750) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1272) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:303) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:111) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:481) ~[elasticsearch-8.11.1.jar:?]
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
		at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
		at java.lang.Thread.run(Thread.java:1583) ~[?:?]
	Suppressed: org.elasticsearch.transport.NodeNotConnectedException: [cluster_II_node_2][10.10.1.31:9300] Node not connected
		at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:869) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:764) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:750) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1272) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:303) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:111) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:481) ~[elasticsearch-8.11.1.jar:?]
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
		at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
		at java.lang.Thread.run(Thread.java:1583) ~[?:?]

Full Error Node receiving:

[2023-12-21T12:47:12,808][WARN ][o.e.t.TcpTransport       ] [cluster_II_node_2] exception caught on transport layer [Netty4TcpChannel{localAddress=/10.10.1.31:9300, remoteAddress=/10.10.1.35:40702, profile=default}], closing connection
java.lang.IllegalArgumentException: CompositeBytesReference cannot hold more than 2GB
	at org.elasticsearch.common.bytes.CompositeBytesReference.ofMultiple(CompositeBytesReference.java:59) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.common.bytes.CompositeBytesReference.of(CompositeBytesReference.java:40) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:104) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:121) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:96) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:61) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.netty4.Netty4MessageInboundHandler.channelRead(Netty4MessageInboundHandler.java:48) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:1583) ~[?:?]

DavidTurner · December 21, 2023, 5:41pm

Looks like another instance of Transport messages exceeding 2GiB are not handled gracefully · Issue #94137 · elastic/elasticsearch · GitHub - the workaround is to send smaller bulk requests and/or avoid ingest pipelines and other scripts which might blow up the sizes of the documents to be replicated.

Martin_Berlin · December 22, 2023, 9:49am

Thanks @DavidTurner
our bulk sizes are less than 100mb and we don't use ingest pipelines.
But we use script updates to fill nested documents in a document.
Unfortunately we need those nested documents and the Bug you referenced doesn't seem to be fixed. We are using Version 8.11.1

DavidTurner · December 22, 2023, 11:09am

That's still pretty large, especially if you're running scripts that blow up the document size massively. Make them smaller.

Martin_Berlin · December 22, 2023, 1:46pm

OK - we were using the defaults of the python client:

elasticsearch-py 7.x
elasticsearch.helpers.streaming_bulk(client, actions, chunk_size=500, max_chunk_bytes=104857600, raise_on_error=True, expand_action_callback=<function expand_action>, raise_on_exception=True, max_retries=0, initial_backoff=2, max_backoff=600, yield_ok=True, ignore_status=(), *args, **kwargs)

= chunk_size=500 , max_chunk_bytes=104857600

what would be an appropriate size?

DavidTurner · December 22, 2023, 3:11pm

Impossible to say without knowing how much your scripted updates expand the documents.

system · January 19, 2024, 3:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster State Yellow: 2 shards initializing with multiple failed attempts: IllegalArgumentException [ReleasableBytesStreamOutput cannot hold more than 2GB of data Elasticsearch	6	1492	June 21, 2023
Attempt to index a large dataset fails Elasticsearch	12	527	July 6, 2017
ElasticSearch 2.2.0 - File Too Large while bulk indexing Elasticsearch	3	1463	July 5, 2017
Java.io.StreamCorruptedException after running bulk indexing for some time Elasticsearch	9	2377	July 5, 2017
Data too large, data for [<transport_request>] would be [10554893106/9.8gb], which is larger than the limit of [10092838912/9.3gb], real usage: [10523239224/9.8gb], new bytes reserved: [31653882/30.1mb]]]] Elasticsearch	17	16471	September 2, 2019

Failed to indices:data/write/bulk[s] on replica because of Netty4TcpChannel / CompositeBytesReference more than 2GB

Related topics