CircuitBreakingException: [parent] Data too large IN ES 7.x

Cluster always get CircuitBreakingException after update to ES7.x, especially running recovery tasks or indexing large data: [internal:index/shard/recovery/start_recovery] or [cluster:monitor/nodes/info[n]], then node left the cluster.
here is log and node stats
After I disable indices.breaker.total.use_real_memory the breaking exception seems not apear again.
Is this question related to this issue?

Yes, the linked issue is related. We're looking into the conditions under which the breaker might trip even though the node could theoretically handle the extra load. This seems to be mostly related to the workload. In your case, best disable the real memory breaker.

It happens again even disable real memory breaker: [parent] Data too large, data for [<http_request>].
Looks like real memory breaker isn't root reason

can you provide the full message? It will tell you information about the different child breakers, which allows to explain where memory is used.

ElasticsearchStatusException[Elasticsearch exception [type=circuit_breaking_exception, reason=[parent] Data too large, data for [<http_request>] would be [30799676956/28.6gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30760015112/28.6gb], new bytes reserved: [39661844/37.8mb]]]
    at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177)
    at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:2053)
    at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:2030)
    at org.elasticsearch.client.RestHighLevelClient$1.onFailure(RestHighLevelClient.java:1947)
    at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onDefinitiveFailure(RestClient.java:857)
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:560)
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:537)
    at shaded.org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
    at shaded.org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
    at shaded.org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:412)
    at shaded.org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:305)
    at shaded.org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:267)
    at shaded.org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
    at shaded.org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
    at shaded.org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:116)
    at shaded.org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:164)
    at shaded.org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:339)
    at shaded.org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:317)
    at shaded.org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:278)
    at shaded.org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:106)
    at shaded.org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:590)
    at java.lang.Thread.run(Thread.java:748)
    Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [http://node:9200], URI [/_bulk?timeout=3m], status line [HTTP/1.1 429 Too Many Requests]
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [30799676956/28.6gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30760015112/28.6gb], new bytes reserved: [39661844/37.8mb]","bytes_wanted":30799676956,"bytes_limit":30601641984,"durability":"TRANSIENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [30799676956/28.6gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30760015112/28.6gb], new bytes reserved: [39661844/37.8mb]","bytes_wanted":30799676956,"bytes_limit":30601641984,"durability":"TRANSIENT"},"status":429}
        at org.elasticsearch.client.RestClient$1.completed(RestClient.java:552)
        ... 16 more

here is node stats: https://del.dog/ibaruginif

The error shows that you're still using the real memory circuit breaker (see real usage: [30760015112/28.6gb) whereas you claim you're not?

I confirm I have disabled real memory circuit breaker:

GET problem_node:9200/_cluster/settings?include_defaults&flat_settings&local&filter_path=defaults.indices*
{
"defaults": {
"indices.analysis.hunspell.dictionary.ignore_case": "false",
"indices.analysis.hunspell.dictionary.lazy": "false",
"indices.breaker.accounting.limit": "100%",
"indices.breaker.accounting.overhead": "1.0",
"indices.breaker.fielddata.limit": "40%",
"indices.breaker.fielddata.overhead": "1.03",
"indices.breaker.fielddata.type": "memory",
"indices.breaker.request.limit": "60%",
"indices.breaker.request.overhead": "1.0",
"indices.breaker.request.type": "memory",
"indices.breaker.total.limit": "70%",
"indices.breaker.total.use_real_memory": "false",
"indices.breaker.type": "hierarchy",
"indices.cache.cleanup_interval": "1m",
"indices.fielddata.cache.size": "-1b",
"indices.lifecycle.poll_interval": "10m",
"indices.mapping.dynamic_timeout": "30s",
"indices.memory.index_buffer_size": "20%",
"indices.memory.interval": "5s",
"indices.memory.max_index_buffer_size": "6g",
"indices.memory.min_index_buffer_size": "48mb",
"indices.memory.shard_inactive_time": "5m",
"indices.queries.cache.all_segments": "false",
"indices.queries.cache.count": "10000",
"indices.queries.cache.size": "10%",
"indices.query.bool.max_clause_count": "1024",
"indices.query.query_string.allowLeadingWildcard": "true",
"indices.query.query_string.analyze_wildcard": "false",
"indices.recovery.internal_action_long_timeout": "1800000ms",
"indices.recovery.internal_action_timeout": "15m",
"indices.recovery.max_bytes_per_sec": "1024m",
"indices.recovery.max_concurrent_file_chunks": "2",
"indices.recovery.recovery_activity_timeout": "1800000ms",
"indices.recovery.retry_delay_network": "5s",
"indices.recovery.retry_delay_state_sync": "500ms",
"indices.requests.cache.expire": "0ms",
"indices.requests.cache.size": "1%",
"indices.store.delete.shard.timeout": "30s"
}
}

How did you disable the real memory circuit breaker? Did you put indices.breaker.total.use_real_memory : false into elasticsearch.yml of all the nodes and restart?

Also, why are you showing the defaults in the settings API call? The default for indices.breaker.total.use_real_memory should be true. The setting needs to be explicitly disabled.

I have disabled the real memory circuit breaker in all data node except master only node(because of master node will not get this exception).
the indices.breaker.total.use_real_memory default value shows false because this setting set in elasticsearch.yml.

Hi @LoadingZhang,

are you using the default CMS GC or did you switch to G1 GC?

yes, I'm using G1GC since ES 5.x

Hi @LoadingZhang,

if you can spare the time to try it out, it could be good to check if re-enabling real memory circuit breaker works if you change jvm.options to have:

10-:-XX:G1ReservePercent=25
10-:-XX:InitiatingHeapOccupancyPercent=30

instead of:

10-:-XX:InitiatingHeapOccupancyPercent=75

I would be very interested in knowing the outcome.

Nodes never get CircuitBreakingException in the last 24 hours, I will take more time to test it, thanks.
BTW, I'm testting ZGC in the mean time, and work well when disable real memory circuit breaker.
I guess -XX:SoftMaxHeapSize in JDK13 would be help to enable real memory circuit breaker.

Hi @LoadingZhang,

thanks for reporting back on this. Unfortunately, I made a mistake in my original post in that InitiatingHeapOccupancyPercent should really have been set to 30. I have edited my post above to avoid confusion if others read this post.

The JVM should auto-tune this parameter after a while, it only uses the original IHOP value until it has a better estimate of what it should be itself. So your test is certainly still valuable, confirming that the G1ReservePercent does reserve enough heap to avoid circuit breaking in your case.

I have not looked too much into ZGC yet, since it still has experimental status. You will need something like the SoftMaxHeapSize option to make it compatible with real memory circuit breaker. Also, you should notice that ZGC does not support compressed oops, meaning you will likely need more heap since all references need 64 bits rather than 32 bits. This will also in itself lead to some performance degradation (reduced cpu cache efficiency and more data to fetch from RAM).

CircuitBreakingException appear again, That is such a bad news.
But the exception is not so frequently, anyway.

Hi @LoadingZhang,

circuit breaking exceptions can occur for legitimate reasons too. Were the cluster/node heavily loaded at the time? In what situation did it occur (recovery, indexing, search etc)?

I hope you can share your ES and GC log files (feel free to PM it to me)?

Yes, cluster is indexing large data, but It's ok when real memory circuit breaker is disable.
I have PM log to you, if you need more log I will send them again.

@HenningAndersen We have the same issue with Zing JDK and ES 7.3.2 (latest).
JVM version (java -version):

java version "11.0.3.0.101" 2019-07-24 LTS
Zing Runtime Environment for Java Applications 19.07.0.0+3 (product build 11.0.3.0.101+12-LTS) Zing 64-Bit Tiered VM 19.07.0.0+3 (product build 11.0.3.0.101-zing_19.07.0.0-b4-product-azlinuxM-X86_64, mixed mode)

ES log:

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [49671046666/46.2gb], which is larger than the limit of [47173546803/43.9gb],
real usage: [49671045120/46.2gb], new bytes reserved: [1546/1.5kb], usages [request=0/0b, fielddata=8478/8.2kb, in_flight_requests=1546/1.5kb, accounting=7745839/7.3mb]

elasticsearch.yml

cluster.name: dba
discovery.seed_hosts:
- es001.tab.com
- es002.tab.com
- es003.tab.com
network.bind_host: 0.0.0.0
network.host: 0.0.0.0
network.publish_host: es001.tab.com
node.name: es001.tab.com
path.data: "/var/lib/elasticsearch/data/dba"
path.logs: "/var/log/elasticsearch/dba"
xpack.ml.enabled: false
xpack.security.enabled: false
xpack.watcher.enabled: false

jvm.option:

-Dfile.encoding=UTF-8
-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Djava.awt.headless=true
-Djna.nosys=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-XX:+AlwaysPreTouch
-XX:+HeapDumpOnOutOfMemoryError
-XX:+UseCMSInitiatingOccupancyOnly
-XX:-OmitStackTraceInFastThrow
-XX:CMSInitiatingOccupancyFraction=75
-Xloggc:/var/log/elasticsearch/dba/gc.log
-Xms50g
-Xmx50g
-Xss1m
-server
-verbose:gc

We are unable to test:

-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

Since this options not supported by Zing, Zing JDK not using G1GC

Our Zing conf:

pmem enabled
fundmemory 64G 64G
fund Grant 4G 4G
fund PausePrevention 4G 4G
nodemask	0xFFFFFFFF

Just to update, we solved the issue of Azul Zing JDK with -XX:GPGCTargetPeakHeapOccupancyPercent=95
Our jvm.conf right now:

-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Djava.awt.headless=true
-Djna.nosys=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-XX:+AlwaysPreTouch
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:GPGCTargetPeakHeapOccupancyPercent=95
-Xloggc:/var/log/elasticsearch/dba/gc.log
-Xms32g
-Xmx32g
-Xss1m
-server
-verbose:gc

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.