CircuitBreakingException on my ES cluster

Hello all,

I keep encountering a CircuitBreakingException on my ES cluster

The following stack trace seems to suggest that the CircuitBreakingException is taking place during a seemingly lightweight call (cluster:monitor/stats). But I have seen similar issues during indices:data/write/bulk calls as well

"org.elasticsearch.transport.RemoteTransportException: [elasticsearch-data01-srv][10.210.146.44:9300][cluster:monitor/stats[n]]"
"Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [14592176820/13.5gb], which is larger than the limit of [14280766259/13.2gb], real usage: [14592173056/13.5gb], new bytes reserved: [3764/3.6kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=3764/3.6kb, accounting=384737042/366.9mb]",
"at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.1.jar:7.3.1]"

I am running ES 7.3 using the bundled JDK. My cluster setup is as follows:

  • 3 dedicated master nodes (t2.xlarge EC2 instances with 16GB RAM. The only non-default jvm.options on these nodes is -Xms7g -Xmx7g)
  • 3 data-only nodes (r4.xlarge EC2 instances with 30.5GB RAM. The only non-default jvm.options on these nodes is -Xms14g -Xmx14g)

Other relevant modifications to my elasticsearch.yml file

  • bootstrap.memory_lock: true
  • http.max_content_length: 1g
  • indices.memory.index_buffer_size: 30%
  • thread_pool.write.queue_size: -1

Other stats:

  • Total number of indices: 174
  • Number of replicas: 1
  • Total shards: 1676 (90% indices are configured to have 5 shards per index. Largest shard is 814.6mb)
  • Total docs count: 20329064
  • Store size in bytes: 88GB

I would really appreciate any insights on what the problem and solution could be and also pointers on how to approach such issues

Thanks!

You have way too many shards.

Total shards: 1676 (90% indices are configured to have 5 shards per index. Largest shard is 814.6mb)

You should re-index every index that has 5 primary shards down to a single primary shard. That will make the cluster much more stable.
If you do that and the circuit-breaker issue persists, further possible causes can be explored, but the re-sharding needs to happen anyway.

These settings seem unusual and potentially dangerous. Can you plesase describe the use case and why you have added these custom settings?

I also agree with @Glen_Smith that the shard count seems excessive given that data volume on disk. You should look to reduce this significantly, e.g. by using the shrink index API to go from 5 to 1 primary shards.

1 Like

Thank you for the hint about the shards @Glen_Smith. I will test this out and revert shortly

To your point, @Christian_Dahlqvist,
http.max_content_length: 1g: Some of our ES HTTP requests had payloads that had breached the 100mb default. The current value of 1g is excessive. We need to evaluate what is the right value for our needs and set the configuration accordingly

indices.memory.index_buffer_size: 30%: We refresh 90% of our ES documents (delete index, recreate index and re-insert documents) every 24 hours as part of our nightly batch processes. Moreover, these documents have a large number of fields (approx 1100). Setting the index_buffer_size to 30% in our old ES 5x cluster had helped us complete our batch processes within a limited time window. We did not change the setting when we upgraded to ES 7x. We need to evaluate whether setting this to 30% still gives us the same timeline benefit in ES 7x or will ES 7x's default value suffice

thread_pool.write.queue_size: -1: This is the default value in any case. Will remove it from elasticsearch.yml

1 Like

After reducing the number of shards to one per index, the circuit breaker issue indeed went away. Thanks for helping out!

Would it be possible to provide some insights as to how we could have debugged this ourselves by looking at the CircuitBreaker exception? The exception only said that the data was larger than 13GB. What we could not decipher was how was the call to monitor/stats was resulting in so much of data!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.