"Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent]

Hi Guys,

I am facing the below issue since few days now -:

"Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [24876638648/23.1gb], which is larger than the limit of [24481313587/22.7gb], real usage: [24864230864/23.1gb], new bytes reserved: [12407784/11.8mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=12407784/11.8mb, accounting=25462489/24.2mb]"

Earlier I had this issue , so I looked into the Es discussion forum and found out I need to scale up the memory. Earlier I was using heap as 12 GB but now I doubled it but still getting this issue.

Here is the configuration for ElasticSearch -:

version: '3.4'
services:
elasticsearch:
image: ${REGISTRY}/elastic/elasticsearch:7.3.2-431
environment:
- cluster.name=mgmt-elasticsearch-cluster
- bootstrap.memory_lock=true
- discovery.zen.minimum_master_nodes=3
- cluster.initial_master_nodes=msql07,msql08,msql09,msql10,msql11,msql12
- SERVICE_NAME=elasticsearch
- TAKE_FILE_OWNERSHIP=true
- ES_JAVA_OPTS=-Xms24g -Xmx24g -XX:-UseConcMarkSweepGC -XX:-UseCMSInitiatingOccupancyOnly -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=75
- HOSTNAME_COMMAND=curl -H Metadata:true -s http://169.254.169.254/metadata/instance?api-version=2019-06-04 | jq -r '.compute.name'
labels:
com.bnsf.mp.description: "ElasticSearch database"
com.bnsf.mp.department: "XF"
logging:
driver: "json-file"
networks:
- logging
volumes:
- type: bind
source: /opt/data/elasticsearch
target: /usr/share/elasticsearch/data
- type: tmpfs
target: /usr/share/elasticsearch/logs
deploy:
labels:
traefik.enable: "true"
traefik.port: "9200"
traefik.frontend.rule: "Host:kibana.xyz.com;PathPrefixStrip:/elasticsearch/"
traefik.frontend.entryPoints: "https"
traefik.docker.network: "logging"
mode: global
endpoint_mode: dnsrr
placement:
constraints: [node.labels.type == sql]
resources:
limits:
cpus: '4.0'
memory: 48G
reservations:
cpus: '2.0'
memory: 24G
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
update_config:
parallelism: 1
delay: 60s
failure_action: rollback
monitor: 180s
max_failure_ratio: 0.25

Due to this Circuitbreaker issue, Kibana is also dying after few hours and ES keeps on un-assigning shards due to which cluster status becomes Yellow.

Attaching few monitoring screenshots from Grafana.

jvm.option is below , but I set heap and other few configs through env (ES_JAVA_OPTS=-Xms24g -Xmx24g -XX:-UseConcMarkSweepGC -XX:-UseCMSInitiatingOccupancyOnly -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=75) -:

JVM configuration

################################################################

IMPORTANT: JVM heap size

################################################################

You should always set the min and max JVM heap

size to the same value. For example, to set

the heap to 4 GB, set:

-Xms4g

-Xmx4g

See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html

for more information

################################################################

Xms represents the initial size of total heap space

Xmx represents the maximum size of total heap space

-Xms1g
-Xmx1g

################################################################

Expert settings

################################################################

All settings below this section are considered

expert settings. Don't tamper with them unless

you understand what you are doing

################################################################

GC configuration

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

G1GC Configuration

NOTE: G1GC is only supported on JDK version 10 or later.

To use G1GC uncomment the lines below.

10-:-XX:-UseConcMarkSweepGC

10-:-XX:-UseCMSInitiatingOccupancyOnly

10-:-XX:+UseG1GC

10-:-XX:InitiatingHeapOccupancyPercent=75

DNS cache policy

cache ttl in seconds for positive DNS lookups noting that this overrides the

JDK security property networkaddress.cache.ttl; set to -1 to cache forever

-Des.networkaddress.cache.ttl=60

cache ttl in seconds for negative DNS lookups noting that this overrides the

JDK security property networkaddress.cache.negative ttl; set to -1 to cache

forever

-Des.networkaddress.cache.negative.ttl=10

optimizations

pre-touch memory pages used by the JVM during initialization

-XX:+AlwaysPreTouch

basic

explicitly set the stack size

-Xss1m

set to headless, just in case

-Djava.awt.headless=true

ensure UTF-8 encoding by default (e.g. filenames)

-Dfile.encoding=UTF-8

use our provided JNA always versus the system one

-Djna.nosys=true

turn off a JDK optimization that throws away stack traces for common

exceptions because stack traces are important for debugging

-XX:-OmitStackTraceInFastThrow

flags to configure Netty

-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

log4j 2

-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

heap dumps

generate a heap dump when an allocation from the Java heap fails

heap dumps are created in the working directory of the JVM

-XX:+HeapDumpOnOutOfMemoryError

specify an alternative path for heap dumps; ensure the directory exists and

has sufficient space

-XX:HeapDumpPath=data

specify an alternative path for JVM fatal error logs

-XX:ErrorFile=logs/hs_err_pid%p.log

JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

JDK 9+ GC logging

9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise

time/date parsing will break in an incompatible way for some date patterns and locals

9-:-Djava.locale.providers=COMPAT

Can someone help me to resolve this issue?

The logs from Elastic Search -:
indent preformatted text by 4 spaces
logging_elasticsearch.0.ta3g4ahn6y27@msql12 | {"type": "server", "timestamp": "2020-11-22T09:00:25,361+0000", "level": "DEBUG", "component": "o.e.a.s.TransportSearchAction", "cluster.name": "mgmt-elasticsearch-cluster", "node.name": "msql12", "cluster.uuid": "IOHuA3EiTqyG-inhZ2ZRcg", "node.id": "tR8aAZbYS-GV4Mj1a2-mCA", "message": "[mp-prod-event-moves-2020.11.22][0], node[v6QeDv1RT2G6f8CPB6p53g], [P], s[STARTED], a[id=b31RPy-oRKWLfSgeT3OQRw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[mp-prod-event-], indicesOptions=IndicesOptions[ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='1605876857124', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=6, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={"size":0,"timeout":"30000ms","query":{"bool":{"must":[{"query_string":{"query":"CLRSTS","fields":[],"type":"best_fields","default_operator":"or","max_determinized_states":10000,"enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"analyze_wildcard":true,"time_zone":"America/Chicago","escape":false,"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}},{"bool":{"should":[{"match_phrase":{"message":{"query":"\"requestStatus\":\"C\"","slop":0,"zero_terms_query":"NONE","boost":1.0}}},{"match_phrase":{"message":{"query":"\"requestStatus\":\"L\"","slop":0,"zero_terms_query":"NONE","boost":1.0}}}],"adjust_pure_negative":true,"minimum_should_match":"1","boost":1.0}},{"range":{"@timestamp":{"from":"2020-11-20T12:30:00.000Z","to":"2020-11-20T20:30:00.000Z","include_lower":true,"include_upper":true,"format":"strict_date_optional_time","boost":1.0}}}],"filter":[{"match_all":{"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":[],"excludes":[]},"stored_fields":"","docvalue_fields":[{"field":"@timestamp","format":"date_time"}],"script_fields":{},"track_total_hits":2147483647,"aggregations":{"2":{"filters":{"filters":{"Havre East DS Lined":{"bool":{"must":[{"query_string":{"query":"CLRSTS AND \"NOCDISP-NC002\" NOT \"REQUESTID\"","fields":,"type":"best_fields","default_operator":"or","max_determinized_states":10000,"enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"Havre East MP Lined":{"bool":{"must":[{"query_string":{"query":"CLRSTS AND \"NOCDISP-NC002\" AND \"REQUESTID\"","fields":,"type":"best_fields","default_operator":"or","max_determinized_states":10000,"enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}}},"other_bucket":false,"other_bucket_key":"other"}}}}}] lastShard [true]" ,
logging_elasticsearch.0.7l3wcaf4h0wk@msql10 | "at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ta3g4ahn6y27@msql12 | "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [msql08][192.168.2.19:9300][indices:data/read/search[can_match]]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.b79xn1o5bdc9@msql08 | "at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) [elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at java.lang.Thread.run(Thread.java:835) [?:?]"] }
logging_elasticsearch.0.ylfypgoewi8z@msql07 | {"type": "server", "timestamp": "2020-11-23T04:52:40,732+0000", "level": "WARN", "component": "o.e.i.e.Engine", "cluster.name": "mgmt-elasticsearch-cluster", "node.name": "msql07", "cluster.uuid": "IOHuA3EiTqyG-inhZ2ZRcg", "node.id": "TPvt3m18TteOrNLZx4KKRA", "message": " [mp-uat1-infra-cassandra-2020.11.23][0] failed engine [primary shard [[mp-uat1-infra-cassandra-2020.11.23][0], node[TPvt3m18TteOrNLZx4KKRA], [P], s[STARTED], a[id=AJ7y7NPSQpCLCg23tqv6Tg]] was demoted while failing replica shard]" ,
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "stacktrace": ["org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [24765919806/23gb], which is larger than the limit of [24481313587/22.7gb], real usage: [24765912464/23gb], new bytes reserved: [7342/7.1kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=7342/7.1kb, accounting=36423318/34.7mb]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) [elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.7l3wcaf4h0wk@msql10 | "at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) [elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.7l3wcaf4h0wk@msql10 | "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ta3g4ahn6y27@msql12 | "Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [24746112742/23gb], which is larger than the limit of [24481313587/22.7gb], real usage: [24746110976/23gb], new bytes reserved: [1766/1.7kb], usages [request=0/0b, fielddata=675/675b, in_flight_requests=3530/3.4kb, accounting=36806685/35.1mb]",
logging_elasticsearch.0.ta3g4ahn6y27@msql12 | "at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.b79xn1o5bdc9@msql08 | "at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.ta3g4ahn6y27@msql12 | "at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.f29yl4yyqf1o@msql09 | "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ta3g4ahn6y27@msql12 | "at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) [elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.f29yl4yyqf1o@msql09 | "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.f29yl4yyqf1o@msql09 | "at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]",
logging_elasticsearch.0.ylfypgoewi8z@msql07 | "at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) [elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.ecr8s2xr26c3@msql11 | "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [msql08][192.168.2.19:9300][indices:data/read/search[phase/query]]",
logging_elasticsearch.0.ta3g4ahn6y27@msql12 | "at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) [elasticsearch-7.3.2.jar:7.3.2]",
logging_elasticsearch.0.ta3g4ahn6y27@msql12 | "at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.3.2.jar:7.3.2]",
error from daemon in stream: Error grabbing logs: rpc error: code = Unavailable desc = transport is closingPreformatted text

Welcome to our community! :smiley: We aren't all guys though.

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you :slight_smile:

Hey Mark,

I tried formatting it using the </> button. I don't see any changes.

Can you please help me in resolving this issue.

I am now using the following parameters for jvm -:

ES_JAVA_OPTS=-Xms27g -Xmx27g -XX:-UseConcMarkSweepGC -XX:-UseCMSInitiatingOccupancyOnly -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30

The number of occurrences of circuit breaker exception has reduced since last 12 hours and the shards are not getting unassigned.

But, Kibana instances once died off due to this exception.
Please suggest what am I missing here.

Thanks
Uttam

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.