CircuitBreakingException[[parent] Data too large on upgrading to elasticsearch 7.7 from 5.16

We have a 3 node elasticsearch cluster (with 16 cores, 64GB RAM) running on 7.7.1 version. Each elasticsearch instance was given heap memory of 4.6GB. We were indexing documents using bulk API. We are encountering the following exception.

Caused by: RemoteTransportException[[platform1][127.0.0.1:9300][indices:data/write/bulk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4506197902/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4488122200/4.1gb], new bytes reserved: [18075702/17.2mb], usages [request=147960/144.4kb, fielddata=13357/13kb, in_flight_requests=2706206662/2.5gb, accounting=35746016/34mb]];
Caused by: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4506197902/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4488122200/4.1gb], new bytes reserved: [18075702/17.2mb], usages [request=147960/144.4kb, fielddata=13357/13kb, in_flight_requests=2706206662/2.5gb, accounting=35746016/34mb]]
        at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:347)
        at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128)
        at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:171)
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:119)
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:103)
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:676)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)

Some of the shards are in UNASSIGNED state and they are not moving to STARTED state until manual reroute is triggered. Some of the shards are in STARTED state but not healthy. See shard 1 of denorm index in the below output.

ubuntu@platform3:~$ curl http://localhost:9200/_cat/shards
testindex 4  p STARTED      286993   3.1gb 10.79.198.111 platform2
testindex 4  r UNASSIGNED   286381   3.1gb 10.79.198.97  platform3
testindex 3  r STARTED      286802     3gb 10.79.196.190 platform1
testindex 3  p STARTED      286408   3.5gb 10.79.198.97  platform3
testindex 13 p STARTED      287826   2.9gb 10.79.198.111 platform2
testindex 13 r STARTED      287570   3.1gb 10.79.198.97  platform3
testindex 15 r INITIALIZING                10.79.198.111 platform2
testindex 15 p STARTED      286536     3gb 10.79.198.97  platform3
testindex 18 r INITIALIZING                10.79.198.111 platform2
testindex 18 p STARTED      287414   3.1gb 10.79.198.97  platform3

denorm  17 p STARTED 8764463  17.7gb 10.62.70.173 platform1
denorm  17 r STARTED 8764463  17.6gb 10.62.70.174 platform2
denorm  14 r STARTED 8847590  18.6gb 10.62.70.173 platform1
denorm  14 p STARTED 8847590  18.6gb 10.62.70.174 platform2
denorm  1  r STARTED 8902163    20gb 10.62.70.173 platform1
denorm  1  p STARTED                 10.62.70.174 platform2
denorm  9  p STARTED 8929604    19gb 10.62.70.175 platform3

Sample program which can cause this problem is at https://filebin.net/nuaca339q8g101tu/BulkIndexer.java?t=5h5lxfmm
The arguments to the program are
java BulkIndexer testindex 40 10000 1000 60 10 1000000

We also observed high GC while executing this program. This is affecting indexing performance.

Why the shards are moving to UNASSIGNED state and why are they never getting recovered? And why some of the shards were in STARTED state but count/size information is not available. We were not able to index documents if any shard is in this state. How to recover from these state?

You may want to increase the heap size as a start if you can, otherwise try to reduce the size of your bulk requests.

We are doing a benchmark before upgrading to ES 7.7.1 from ES 5.6.16. We are observing same is working fine with ES 5.6 in an identical environemnt. Can you provide us some more insight what is causing this degradation of performance in ES 7.7.1.

That's a huge version jump so there's tonnes of changes in there.
Chances are there is something that we implemented during that gap to reduce the risk of OOM.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.