CircuitBreakingException: [parent] Data too large

Hi,
I am using Elasticsearch 7.8.0 in k8s environment. I have 3 master, 2 data and 3 ingest nodes. I am getting circuit breaker exception on one of the data node.
Below are the details:

  1. java -version
    openjdk version "11.0.7" 2020-04-14 LTS
    OpenJDK Runtime Environment 18.9 (build 11.0.7+10-LTS)
    OpenJDK 64-Bit Server VM 18.9 (build 11.0.7+10-LTS, mixed mode)
  2. CentOS Linux release 7.8.2003 (Core)

Error logs:

{"type":"log","host":"data-0","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","time": "2020-12-21T16:19:45.261Z","logger":"o.e.i.c.IndicesClusterStateService","timezone":"UTC","marker":"[data-0] ","log":{"message":"[fluentd-ncms-log-2020.12.21][0] marking and sending shard failed due to [failed recovery]"}}
org.elasticsearch.indices.recovery.RecoveryFailedException: [fluentd-ncms-log-2020.12.21][0]: Recovery failed from {data-1}{MCKwMFFeR1SvPeChiHhbbA}{sAzv5YAjS1OrC3Pam6bc5A}{192.168.2.78}{192.168.2.78:9300}{d} into {data-0}{Fu0QhXwWQjuSxbSfBuHrzg}{-Ra2as3UTGel3FzhFpWERg}{192.168.253.69}{192.168.253.69:9300}{d}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:249) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:294) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) [elasticsearch-7.8.0.jar:7.8.0]
        at com.floragunn.searchguard.transport.SearchGuardInterceptor$RestoringTransportResponseHandler.handleException(SearchGuardInterceptor.java:265) [search-guard-suite-security-7.8.0-43.0.0-146.jar:7.8.0-43.0.0-146]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1173) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:235) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.0.jar:7.8.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [data-1][192.168.2.78:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [1029154666/981.4mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1029153896/981.4mb], new bytes reserved: [770/770b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=751599506/716.7mb, accounting=155536/151.8kb]
        at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:347) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:210) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:119) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:140) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) ~[elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:73) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1518) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:834) ~[?:?]

Elasticsearch process:

elastic+    70    14  1  2020 ?        09:53:16 /etc/alternatives/jre_openjdk//bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-10624982066029153247 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Des.cgroups.hierarchy.override=/ -Xms1g -Xmx1g -XX:MaxDirectMemorySize=536870912 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=oss -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch

JVM option:

11-:-XX:+UseG1GC
11-:-XX:G1ReservePercent=25
11-:-XX:InitiatingHeapOccupancyPercent=30

Data pod's memory limit, request, JVM configuration are as below:

Limits:
  cpu:     1
  memory:  2Gi
Requests:
  cpu:      100m
  memory:   1Gi

ES_JAVA_OPTS: -Xms1g -Xmx1g

node/stats output can be seen here
Please help me to overcome from this.

You have very small data nodes with low limits but also very little data, so I do not see any reason why the circuit breaker would trigger. I do however see that you are using SearchGuard, and I have no experience with this and do not know how well it uses memory in small environments like this. I would recommend deploying the default distribution and try using the built in security that is available through the free Basic tier to see if this makes any difference. That would give an indication of whether this is a general problem or SearchGuard specific.

1 Like

You might have run into the issue fixed by https://github.com/elastic/elasticsearch/pull/58674, I would recommend upgrading to a version including that fix (7.9+).

2 Likes

Hi,
I want to know what is marking and sending shard failed due to [failed recovery] and when does it trigger?

{"logger":"o.e.i.c.IndicesClusterStateService","timezone":"UTC","marker":"[data-0] ","log":{"message":"[fluentd-ncms-log-2020.12.21][0] marking and sending shard failed due to [failed recovery]"}}
org.elasticsearch.indices.recovery.RecoveryFailedException: [fluentd-ncms-log-2020.12.21][0]: Recovery failed from {data-1}{MCKwMFFeR1SvPeChiHhbbA}{sAzv5YAjS1OrC3Pam6bc5A}{192.168.2.78}{192.168.2.78:9300}{d} into {data-0}{Fu0QhXwWQjuSxbSfBuHrzg}{-Ra2as3UTGel3FzhFpWERg}{192.168.253.69}{192.168.253.69:9300}{d}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:249) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:294) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) [elasticsearch-7.8.0.jar:7.8.0]
        at com.floragunn.searchguard.transport.SearchGuardInterceptor$RestoringTransportResponseHandler.handleException(SearchGuardInterceptor.java:265) [search-guard-suite-security-7.8.0-43.0.0-146.jar:7.8.0-43.0.0-146]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1173) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:235) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.0.jar:7.8.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [data-1][192.168.2.78:9300][internal:index/shard/recovery/start_recovery]

Hi @chaitra_hegde,

marking and sending shard failed due to [failed recovery] is really two parts:

marking and sending shard failed: this happens when a shard fails for some reason. Elasticsearch tells the master node that the shard failed and the master node takes appropriate action (allocating it somewhere else).

[failed recovery]: this indicates that the failure of the shard happened while recovering (initializing) the shard. The recovery is when a replica shard is initialized by copying data over from the primary shard.

The particular trigger here was the circuit breaker exception. This happens when ES thinks that too much memory is in use, either for a specific subsystem or overall. In this case, it was the parent breaker that triggered this, which is overall memory use. The mentioned fix could help here or it could be a legitimate memory overuse. See also this comment for a deeper breakdown of the circuit breaker message.