CircuitBreakingException Data too large

Barak · June 21, 2020, 8:08am

Hi,

I have an ES stack running on AWS spot instances, so shard reallocation happens quite frequently.
Occasionally I receive the following error message and then the shard remains unassigned without further allocation attempts:

nested: RemoteTransportException[[ip-172-30-2-197.ec2.internal][172.30.2.197:9300][internal:index/shard/recovery/filesInfo]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [32428797638/30.2gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32428793232/30.2gb], new bytes reserved: [4406/4.3kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=4406/4.3kb, accounting=708708/692kb]]; ","allocation_status":"no_attempt"}}

How can I avoid these errors?
Do these errors take into account the number of allocation retries? Because I have set the max_retries to 20

Thanks!

Cluster info:
Version 7.7.1
9 data nodes
1 large index of 100 million docs (196 GB) - 21 shards (7 primary with 2 replicas)
The other indices are very small

JVM settings:

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms31g
-Xmx31g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
# 10-:-XX:-UseConcMarkSweepGC
# 10-:-XX:-UseCMSInitiatingOccupancyOnly
# 10-:-XX:+UseG1GC
# 10-:-XX:InitiatingHeapOccupancyPercent=75

## DNS cache policy
# cache ttl in seconds for positive DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.ttl; set to -1 to cache forever
-Des.networkaddress.cache.ttl=60
# cache ttl in seconds for negative DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.negative ttl; set to -1 to cache
# forever
-Des.networkaddress.cache.negative.ttl=10

## optimizations

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

## JDK 8 GC logging

#8:-XX:+PrintGCDetails
#8:-XX:+PrintGCDateStamps
#8:-XX:+PrintTenuringDistribution
#8:-XX:+PrintGCApplicationStoppedTime
#8:-Xloggc:/var/log/elasticsearch/gc.log
#8:-XX:+UseGCLogFileRotation
#8:-XX:NumberOfGCLogFiles=32
#8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
#9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
#9-:-Djava.locale.providers=COMPAT

# temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
10-:-XX:UseAVX=2

jerrac · July 13, 2020, 5:22pm

I've been experiencing the same issue off and on. My solution is to increase how much ram java can use. But that's not exactly ideal.

Is there a way to set the upper limit on how large an individual shard can be?

I'm off to research that.

Barak · July 16, 2020, 10:33am

I doubt it. As the index grows, your only option to reduce the shard size will be re-sharding

Steve_Mushero · July 17, 2020, 3:33am

Only way to limit is add more shards which mean reindexing - how big are the shards? Looks like 196GB / 7 = 28GB each which is reasonable, but given all that's going on in the cluster, must be too large to move around. Not clear which breaker this is tripping as usages listed are small - how many indexes/shards are in the cluster, as that can eat RAM if you have thousands.

system · August 14, 2020, 3:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
RemoteTransportException triggered by parent circuit breaker Elasticsearch	1	992	August 23, 2017
CircuitBreakingException internal:index/shard/recovery/start_recovery Elasticsearch	1	578	June 25, 2018
[parent] data too large Elasticsearch	1	504	June 18, 2020
CircuitBreakingException[[parent] Data too large on upgrading to elasticsearch 7.7 from 5.16 Elasticsearch	4	458	January 7, 2021
Circuit breaker exception Elasticsearch Elasticsearch	2	449	September 27, 2019

CircuitBreakingException Data too large

Related topics