Hi,
I am using Elasticsearch 7.8.0 in k8s environment. I have 3 master, 2 data and 3 ingest nodes. I am getting circuit breaker exception on one of the data node.
Below are the details:
- java -version
openjdk version "11.0.7" 2020-04-14 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.7+10-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.7+10-LTS, mixed mode) - CentOS Linux release 7.8.2003 (Core)
Error logs:
{"type":"log","host":"data-0","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","time": "2020-12-21T16:19:45.261Z","logger":"o.e.i.c.IndicesClusterStateService","timezone":"UTC","marker":"[data-0] ","log":{"message":"[fluentd-ncms-log-2020.12.21][0] marking and sending shard failed due to [failed recovery]"}}
org.elasticsearch.indices.recovery.RecoveryFailedException: [fluentd-ncms-log-2020.12.21][0]: Recovery failed from {data-1}{MCKwMFFeR1SvPeChiHhbbA}{sAzv5YAjS1OrC3Pam6bc5A}{192.168.2.78}{192.168.2.78:9300}{d} into {data-0}{Fu0QhXwWQjuSxbSfBuHrzg}{-Ra2as3UTGel3FzhFpWERg}{192.168.253.69}{192.168.253.69:9300}{d}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:249) [elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:294) [elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) [elasticsearch-7.8.0.jar:7.8.0]
at com.floragunn.searchguard.transport.SearchGuardInterceptor$RestoringTransportResponseHandler.handleException(SearchGuardInterceptor.java:265) [search-guard-suite-security-7.8.0-43.0.0-146.jar:7.8.0-43.0.0-146]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1173) [elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:235) [elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.0.jar:7.8.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [data-1][192.168.2.78:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [1029154666/981.4mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1029153896/981.4mb], new bytes reserved: [770/770b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=751599506/716.7mb, accounting=155536/151.8kb]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:347) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:210) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:119) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:140) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:73) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1518) ~[?:?]
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267) ~[?:?]
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:834) ~[?:?]
Elasticsearch process:
elastic+ 70 14 1 2020 ? 09:53:16 /etc/alternatives/jre_openjdk//bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-10624982066029153247 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Des.cgroups.hierarchy.override=/ -Xms1g -Xmx1g -XX:MaxDirectMemorySize=536870912 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=oss -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch
JVM option:
11-:-XX:+UseG1GC
11-:-XX:G1ReservePercent=25
11-:-XX:InitiatingHeapOccupancyPercent=30
Data pod's memory limit, request, JVM configuration are as below:
Limits: cpu: 1 memory: 2Gi Requests: cpu: 100m memory: 1Gi
ES_JAVA_OPTS: -Xms1g -Xmx1g
node/stats output can be seen here
Please help me to overcome from this.