CircuitBreakingException: [parent] Data too large

chaitra_hegde · January 12, 2021, 7:46am

Hi,
I am using Elasticsearch 7.8.0 in k8s environment. I have 3 master, 2 data and 3 ingest nodes. I am getting circuit breaker exception on one of the data node.
Below are the details:

java -version
openjdk version "11.0.7" 2020-04-14 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.7+10-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.7+10-LTS, mixed mode)
CentOS Linux release 7.8.2003 (Core)

Error logs:

{"type":"log","host":"data-0","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","time": "2020-12-21T16:19:45.261Z","logger":"o.e.i.c.IndicesClusterStateService","timezone":"UTC","marker":"[data-0] ","log":{"message":"[fluentd-ncms-log-2020.12.21][0] marking and sending shard failed due to [failed recovery]"}}
org.elasticsearch.indices.recovery.RecoveryFailedException: [fluentd-ncms-log-2020.12.21][0]: Recovery failed from {data-1}{MCKwMFFeR1SvPeChiHhbbA}{sAzv5YAjS1OrC3Pam6bc5A}{192.168.2.78}{192.168.2.78:9300}{d} into {data-0}{Fu0QhXwWQjuSxbSfBuHrzg}{-Ra2as3UTGel3FzhFpWERg}{192.168.253.69}{192.168.253.69:9300}{d}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:249) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:294) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) [elasticsearch-7.8.0.jar:7.8.0]
        at com.floragunn.searchguard.transport.SearchGuardInterceptor$RestoringTransportResponseHandler.handleException(SearchGuardInterceptor.java:265) [search-guard-suite-security-7.8.0-43.0.0-146.jar:7.8.0-43.0.0-146]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1173) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:235) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.0.jar:7.8.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [data-1][192.168.2.78:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [1029154666/981.4mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1029153896/981.4mb], new bytes reserved: [770/770b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=751599506/716.7mb, accounting=155536/151.8kb]
        at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:347) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:210) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:119) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:140) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:73) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1518) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:834) ~[?:?]

Elasticsearch process:

elastic+    70    14  1  2020 ?        09:53:16 /etc/alternatives/jre_openjdk//bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-10624982066029153247 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Des.cgroups.hierarchy.override=/ -Xms1g -Xmx1g -XX:MaxDirectMemorySize=536870912 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=oss -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch

JVM option:

11-:-XX:+UseG1GC
11-:-XX:G1ReservePercent=25
11-:-XX:InitiatingHeapOccupancyPercent=30

Data pod's memory limit, request, JVM configuration are as below:

Limits:
  cpu:     1
  memory:  2Gi
Requests:
  cpu:      100m
  memory:   1Gi
ES_JAVA_OPTS: -Xms1g -Xmx1g

node/stats output can be seen here
Please help me to overcome from this.

Christian_Dahlqvist · January 12, 2021, 8:11am

You have very small data nodes with low limits but also very little data, so I do not see any reason why the circuit breaker would trigger. I do however see that you are using SearchGuard, and I have no experience with this and do not know how well it uses memory in small environments like this. I would recommend deploying the default distribution and try using the built in security that is available through the free Basic tier to see if this makes any difference. That would give an indication of whether this is a general problem or SearchGuard specific.

HenningAndersen · January 12, 2021, 10:52am

You might have run into the issue fixed by https://github.com/elastic/elasticsearch/pull/58674, I would recommend upgrading to a version including that fix (7.9+).

chaitra_hegde · January 12, 2021, 12:16pm

Hi,
I want to know what is marking and sending shard failed due to [failed recovery] and when does it trigger?

{"logger":"o.e.i.c.IndicesClusterStateService","timezone":"UTC","marker":"[data-0] ","log":{"message":"[fluentd-ncms-log-2020.12.21][0] marking and sending shard failed due to [failed recovery]"}}
org.elasticsearch.indices.recovery.RecoveryFailedException: [fluentd-ncms-log-2020.12.21][0]: Recovery failed from {data-1}{MCKwMFFeR1SvPeChiHhbbA}{sAzv5YAjS1OrC3Pam6bc5A}{192.168.2.78}{192.168.2.78:9300}{d} into {data-0}{Fu0QhXwWQjuSxbSfBuHrzg}{-Ra2as3UTGel3FzhFpWERg}{192.168.253.69}{192.168.253.69:9300}{d}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:249) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:294) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) [elasticsearch-7.8.0.jar:7.8.0]
        at com.floragunn.searchguard.transport.SearchGuardInterceptor$RestoringTransportResponseHandler.handleException(SearchGuardInterceptor.java:265) [search-guard-suite-security-7.8.0-43.0.0-146.jar:7.8.0-43.0.0-146]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1173) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:235) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.0.jar:7.8.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [data-1][192.168.2.78:9300][internal:index/shard/recovery/start_recovery]

HenningAndersen · January 18, 2021, 3:00pm

Hi @chaitra_hegde,

marking and sending shard failed due to [failed recovery] is really two parts:

marking and sending shard failed: this happens when a shard fails for some reason. Elasticsearch tells the master node that the shard failed and the master node takes appropriate action (allocating it somewhere else).

[failed recovery]: this indicates that the failure of the shard happened while recovering (initializing) the shard. The recovery is when a replica shard is initialized by copying data over from the primary shard.

The particular trigger here was the circuit breaker exception. This happens when ES thinks that too much memory is in use, either for a specific subsystem or overall. In this case, it was the parent breaker that triggered this, which is overall memory use. The mentioned fix could help here or it could be a legitimate memory overuse. See also this comment for a deeper breakdown of the circuit breaker message.

system · February 15, 2021, 3:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
"Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Elasticsearch	4	766	December 22, 2020
Circuit breaker exception Elasticsearch Elasticsearch	2	463	September 27, 2019
CircuitBreakingException: [parent] Data too large, data for [<transport_request>] Elasticsearch	7	23769	September 5, 2018
CircuitBreaker: [parent] Data too large, data for [<transport_request>] Elasticsearch	2	1674	September 5, 2019
CircuitBreakingException [parent] Data too large in 7.4.2 Elasticsearch	2	2383	May 19, 2020

CircuitBreakingException: [parent] Data too large

Related topics