Circuit_breaking_exception during reindex

We've started getting frequent circuit breaker exceptions in our cluster, and I need some advice on how to tackle our situation. I've read up on circuit breakers, read lots of posts on the topic, but I'm still not sure what we need to do.

We're running a 9 node cluster at 7.3.1 with each node at 64 GB of OS level ram and 30 GB heap. All nodes run as Docker containers with no custom GC settings.

Here's an example error message from a basic reindex operation:

elasticsearch.exceptions.TransportError: TransportError(429, '{"took":1025273,"timed_out":false,"total":482123,"updated":469000,"created":0,"deleted":0,"batches":469,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[{"shard":-1,"reason":{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<transport_request>] would be [30609748696/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30609747960/28.5gb], new bytes reserved: [736/736b], usages [request=0/0b, fielddata=35860/35kb, in_flight_requests=736/736b, accounting=524016769/499.7mb]","bytes_wanted":30609748696,"bytes_limit":30601641984,"durability":"PERMANENT"}}]}')

I've started monitoring the breakers and our problem seems to be with the parent breaker. Here's an example where one of our nodes hits the limit for the parent breaker:

Some questions:

  1. Would our cluster benefit from reducing the max heap size on each node? All nodes use zero-based compressed oops.

  2. Should we add more nodes to the cluster?

  3. Should we use a more aggressive GC?

Any tips on how to fix this would be very helpful.

Hi @tomhe,

you did state that you used "no custom GC settings", but I still need to ask if you use G1 GC? Since if so, there are adjustments to the GC settings that can help.

Also, can you for completeness include the full exception trace logged for the circuit breaker? And if you have GC logs enabled, seeing those could also be interesting.

I still need to ask if you use G1 GC? Since if so, there are adjustments to the GC settings that can help.

At the time of my post I was using a jvm.options file that only specified the heap size and nothing else, and I'm not sure which GC is used if none is specified. I have since started using the jvm.options that is shipped with the Docker image, so now I run G1GC like this:

## GC configuration
# -XX:+UseConcMarkSweepGC
# -XX:CMSInitiatingOccupancyFraction=75
# -XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
10-:-XX:-UseConcMarkSweepGC
10-:-XX:-UseCMSInitiatingOccupancyOnly
10-:-XX:+UseG1GC
10-:-XX:G1ReservePercent=25
10-:-XX:InitiatingHeapOccupancyPercent=30

Full jvm.options:

Also, can you for completeness include the full exception trace logged for the circuit breaker?

I don't have the full exception for exactly the event in my original post, but here is a recent one:

And if you have GC logs enabled, seeing those could also be interesting.

I don't see anything special in the GC logs around the timestamp in above exception. Looks like this:

We're still seeing frequent breaker exceptions which is causing quite some troubles for us. Even the Kibana UI fails from time to time because it can't access the .kibana index due to circuit brekaing exceptions.

We have also started getting unassigned shards due to circuit breaker exceptions during rerouting:

"unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2019-09-18T14:02:55.475Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [fzlHKosuRseXxSPur04-Sg]: failed recovery, failure RecoveryFailedException[[doors-daily-004][5]: Recovery failed from {node-007}{pZnexMJIQF-Heq9jTjwi-g}{xh5vQb-rQ72ExvbFfzqRbQ}{10.33.9.75}{10.33.9.75:9300}{di}{ml.machine_memory=67388325888, ml.max_open_jobs=20, xpack.installed=true} into {node-003}{fzlHKosuRseXxSPur04-Sg}{9LQEzjF1QW6_gkKfM7EUog}{10.33.9.92}{10.33.9.92:9300}{dim}{ml.machine_memory=68311584768, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[node-007][172.25.0.2:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [32015385438/29.8gb], which is larger than the limit of [30601641984/28.5gb], real usage: [32015384064/29.8gb], new bytes reserved: [1374/1.3kb], usages [request=0/0b, fielddata=98253/95.9kb, in_flight_requests=666487240/635.6mb, accounting=574699439/548mb]]; ",
    "last_allocation_status" : "no_attempt"
  }

HI @tomhe,

which java version are you using? It could look like a java 8 version (or at least that might explain the symptoms)?

I only mentioned G1 because some users have reported issues using G1 and we are working on finding the necessary configuration to run G1 together with the real memory circuit breaker. Using CMS (the default for Elasticsearch) is still the easiest/safest to stay on the most treaded path.

We're using the bundled version in the Docker images published by Elastic:

sh-4.2# /usr/share/elasticsearch/jdk/bin/java --version
openjdk 12.0.2 2019-07-16
OpenJDK Runtime Environment (build 12.0.2+10)
OpenJDK 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)

Would you recommend us to switch to CMS?

Hi @tomhe,

Maybe you can share the full GC log file so I can take a look first? Feel free to PM me with the file if you prefer that.

Thanks, I'll send you the log from one of our nodes. We're currently logging like this:

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due tointernationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

Is that sufficient?

Hi @tomhe,

it would be good to see one of the gc-log files from a node, covering when a circuit break occurred.

Above is fine, though if you have the chance, adding gc+ihop=trace makes sense, that is:

9-:-Xlog:gc*,gc+age=trace,gc+ihop=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

Ok, good. Will update the logging configuration on our nodes and send the log to you in a PM.

Any news about this? I have the same issue.

No, this is still causing problems for us. I am considering switching to CMS.

A user has reported that increasing G1ReservePercent to 35 has helped out their situation. Could be worth trying out. Main downside is that it will trigger GC a bit more aggressively and thus use more CPU.

Also, if you have log information on the circuit breaker incidents as well as GC logs overlapping that time, I would be interested in seeing them. @tomhe, you already did so for prior incidents, but getting a new set of log files from all nodes of an incident could be good. Feel free to PM log files directly to me.

Thanks, I will start by trying with G1ReservePercent at 35 to see how that works.

I'd like to track the frequency of circuit breaker exceptions across our cluster. What would be the easiest way to do that?

I think the easiest route is to feed the logs (preferably the json variant) into a separate ES cluster for further analysis. This could likely be just a single-node cluster. I think this description could work out.

After that is done, you should be able to find a suitable pattern to search for (something like message: CircuitBreakingException is a starting point).

Thanks. I ended up pulling data from the /_nodes/stats/breaker endpoint and watching the "tripped" counter. Looks to be working, but I don't understand the value of that variable. My assumption was that it would increment with one for each exception, but when an exception occurs I've seen it increase anywhere from 7 to 1451.

Also, we've added three more nodes to the cluster, from 9 to 12, and now we're only seeing a handful of breaker exceptions per day.

The tripped counter for the "parent" breaker should only increase once per memory request that is rejected. But it is likely that not all such rejections are reported in log files (many are just reported back to the client). We improved the logging on this, but unfortunately that is not yet out in a release.

Such rejections are likely to occur in bursts until GC freed memory to go under the CB limit.

Increasing cluster capacity will reduce the risk of seeing these.

If you have ES logs with CB exceptions and GC logs covering the timestamp (preferably the whole file, but at least some minutes before and after too), I should be interested in seeing those.

Will send the logs in a PM.