Failed to indices:data/write/bulk[s] on replica because of Netty4TcpChannel / CompositeBytesReference more than 2GB

While Indexing to our Cluster sometimes this error occures turning the cluster in red & yellow state:

One node is trying to "perform indices:data/write/bulk[s] on replica" on another node but fails because of "exception caught on transport layer [Netty4TcpChannel...closing connection java.lang.IllegalArgumentException: CompositeBytesReference cannot hold more than 2GB"

We did not set any specific "network.tcp.send_buffer_size" or "network.tcp.receive_buffer_size" and the system values are:
sysctl -a | grep rmem
net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.ipv4.tcp_rmem = 4096 131072 6291456
net.ipv4.udp_rmem_min = 4096

Any ideas? Google searches for this problem resulted only in Source code results...

Full Error node writing/sending:

[WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210854}{false}{true}{false}] of size [669561038] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9023ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,800][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210881}{false}{true}{false}] of size [690408] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9014ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210885}{false}{true}{false}] of size [2319] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9014ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210888}{false}{true}{false}] of size [343027] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9011ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210892}{false}{true}{false}] of size [2248] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9010ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210896}{false}{true}{false}] of size [2416] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [9004ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210899}{false}{true}{false}] of size [2776] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [8754ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210906}{false}{true}{false}] of size [3078] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [8200ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,801][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210909}{false}{true}{false}] of size [3647] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [8200ms] which is above the warn threshold of [5000ms] with success [true]
[2023-12-21T12:47:12,809][INFO ][o.e.t.ClusterConnectionManager] [cluster_II_node_6] transport connection to [{cluster_II_node_2}{L6USlqkNSAKjUKJ_iNC3yA}{Sd-dtRctQBWKfn2OdHvadQ}{cluster_II_node_2}{10.10.1.31}{10.10.1.31:9300}{d}{8.11.1}{7000099-8500003}] closed by remote
[2023-12-21T12:47:12,810][WARN ][o.e.t.OutboundHandler    ] [cluster_II_node_6] sending transport message [Request{indices:data/write/bulk[s][r]}{6210912}{false}{true}{false}] of size [19389305] on [Netty4TcpChannel{localAddress=/10.10.1.35:40702, remoteAddress=10.10.1.31/10.10.1.31:9300, profile=default}] took [8037ms] which is above the warn threshold of [5000ms] with success [false]
[2023-12-21T12:47:18,705][WARN ][o.e.a.b.TransportShardBulkAction] [cluster_II_node_6] [[cluster_node_2023_12_20_16_46_55][44]] failed to perform indices:data/write/bulk[s] on replica [cluster_node_2023_12_20_16_46_55][44], node[L6USlqkNSAKjUKJ_iNC3yA], [R], s[STARTED], a[id=a5hzS6-1TmGG4qHt3_GQxw], failed_attempts[0]
org.elasticsearch.transport.NodeNotConnectedException: [cluster_II_node_2][10.10.1.31:9300] Node not connected
	at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:869) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:764) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:750) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1272) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:303) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:111) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:481) ~[elasticsearch-8.11.1.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1583) ~[?:?]
	Suppressed: org.elasticsearch.transport.NodeDisconnectedException: [cluster_II_node_2][10.10.1.31:9300][indices:data/write/bulk[s][r]] disconnected
	Suppressed: org.elasticsearch.transport.NodeNotConnectedException: [cluster_II_node_2][10.10.1.31:9300] Node not connected
		at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:869) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:764) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:750) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1272) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:303) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:111) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:481) ~[elasticsearch-8.11.1.jar:?]
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
		at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
		at java.lang.Thread.run(Thread.java:1583) ~[?:?]
	Suppressed: org.elasticsearch.transport.NodeNotConnectedException: [cluster_II_node_2][10.10.1.31:9300] Node not connected
		at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:869) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:764) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:750) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1272) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:303) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:111) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.11.1.jar:?]
		at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:481) ~[elasticsearch-8.11.1.jar:?]
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
		at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
		at java.lang.Thread.run(Thread.java:1583) ~[?:?]

Full Error Node receiving:

[2023-12-21T12:47:12,808][WARN ][o.e.t.TcpTransport       ] [cluster_II_node_2] exception caught on transport layer [Netty4TcpChannel{localAddress=/10.10.1.31:9300, remoteAddress=/10.10.1.35:40702, profile=default}], closing connection
java.lang.IllegalArgumentException: CompositeBytesReference cannot hold more than 2GB
	at org.elasticsearch.common.bytes.CompositeBytesReference.ofMultiple(CompositeBytesReference.java:59) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.common.bytes.CompositeBytesReference.of(CompositeBytesReference.java:40) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:104) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:121) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:96) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:61) ~[elasticsearch-8.11.1.jar:?]
	at org.elasticsearch.transport.netty4.Netty4MessageInboundHandler.channelRead(Netty4MessageInboundHandler.java:48) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:1583) ~[?:?]

Looks like another instance of Transport messages exceeding 2GiB are not handled gracefully · Issue #94137 · elastic/elasticsearch · GitHub - the workaround is to send smaller bulk requests and/or avoid ingest pipelines and other scripts which might blow up the sizes of the documents to be replicated.

Thanks @DavidTurner
our bulk sizes are less than 100mb and we don't use ingest pipelines.
But we use script updates to fill nested documents in a document.
Unfortunately we need those nested documents and the Bug you referenced doesn't seem to be fixed. We are using Version 8.11.1

That's still pretty large, especially if you're running scripts that blow up the document size massively. Make them smaller.

OK - we were using the defaults of the python client:

elasticsearch-py 7.x
elasticsearch.helpers.streaming_bulk(client, actions, chunk_size=500, max_chunk_bytes=104857600, raise_on_error=True, expand_action_callback=<function expand_action>, raise_on_exception=True, max_retries=0, initial_backoff=2, max_backoff=600, yield_ok=True, ignore_status=(), *args, **kwargs)

= chunk_size=500 , max_chunk_bytes=104857600

what would be an appropriate size?

Impossible to say without knowing how much your scripted updates expand the documents.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.