CircuitBreakingException on my ES cluster

SJH · October 24, 2019, 3:45pm

Hello all,

I keep encountering a CircuitBreakingException on my ES cluster

The following stack trace seems to suggest that the CircuitBreakingException is taking place during a seemingly lightweight call (cluster:monitor/stats). But I have seen similar issues during indices:data/write/bulk calls as well

"org.elasticsearch.transport.RemoteTransportException: [elasticsearch-data01-srv][10.210.146.44:9300][cluster:monitor/stats[n]]"
"Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [14592176820/13.5gb], which is larger than the limit of [14280766259/13.2gb], real usage: [14592173056/13.5gb], new bytes reserved: [3764/3.6kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=3764/3.6kb, accounting=384737042/366.9mb]",
"at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.1.jar:7.3.1]"

I am running ES 7.3 using the bundled JDK. My cluster setup is as follows:

3 dedicated master nodes (t2.xlarge EC2 instances with 16GB RAM. The only non-default jvm.options on these nodes is -Xms7g -Xmx7g)
3 data-only nodes (r4.xlarge EC2 instances with 30.5GB RAM. The only non-default jvm.options on these nodes is -Xms14g -Xmx14g)

Other relevant modifications to my elasticsearch.yml file

bootstrap.memory_lock: true
http.max_content_length: 1g
indices.memory.index_buffer_size: 30%
thread_pool.write.queue_size: -1

Other stats:

Total number of indices: 174
Number of replicas: 1
Total shards: 1676 (90% indices are configured to have 5 shards per index. Largest shard is 814.6mb)
Total docs count: 20329064
Store size in bytes: 88GB

I would really appreciate any insights on what the problem and solution could be and also pointers on how to approach such issues

Thanks!

Glen_Smith · October 25, 2019, 1:59am

You have way too many shards.

Total shards: 1676 (90% indices are configured to have 5 shards per index. Largest shard is 814.6mb)

You should re-index every index that has 5 primary shards down to a single primary shard. That will make the cluster much more stable.
If you do that and the circuit-breaker issue persists, further possible causes can be explored, but the re-sharding needs to happen anyway.

Christian_Dahlqvist · October 25, 2019, 5:48am

These settings seem unusual and potentially dangerous. Can you plesase describe the use case and why you have added these custom settings?

I also agree with @Glen_Smith that the shard count seems excessive given that data volume on disk. You should look to reduce this significantly, e.g. by using the shrink index API to go from 5 to 1 primary shards.

SJH · October 25, 2019, 11:56am

Thank you for the hint about the shards @Glen_Smith. I will test this out and revert shortly

To your point, @Christian_Dahlqvist,
http.max_content_length: 1g: Some of our ES HTTP requests had payloads that had breached the 100mb default. The current value of 1g is excessive. We need to evaluate what is the right value for our needs and set the configuration accordingly

indices.memory.index_buffer_size: 30%: We refresh 90% of our ES documents (delete index, recreate index and re-insert documents) every 24 hours as part of our nightly batch processes. Moreover, these documents have a large number of fields (approx 1100). Setting the index_buffer_size to 30% in our old ES 5x cluster had helped us complete our batch processes within a limited time window. We did not change the setting when we upgraded to ES 7x. We need to evaluate whether setting this to 30% still gives us the same timeline benefit in ES 7x or will ES 7x's default value suffice

thread_pool.write.queue_size: -1: This is the default value in any case. Will remove it from elasticsearch.yml

SJH · November 4, 2019, 1:31pm

After reducing the number of shards to one per index, the circuit breaker issue indeed went away. Thanks for helping out!

Would it be possible to provide some insights as to how we could have debugged this ourselves by looking at the CircuitBreaker exception? The exception only said that the data was larger than 13GB. What we could not decipher was how was the call to monitor/stats was resulting in so much of data!

system · December 2, 2019, 1:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CircuitBreakingException: [parent] Data too large is coming in ES (7.2.0) Elasticsearch	13	1935	November 22, 2019
7.4.0 Circuit breaking exceptions Elasticsearch	8	2564	December 17, 2019
[parent] data too large Elasticsearch	1	504	June 18, 2020
CircuitBreakingException: [parent] Data too large, data error Elasticsearch	3	714	August 13, 2020
CircuitBreakingException[[parent] Data too large on upgrading to elasticsearch 7.7 from 5.16 Elasticsearch	4	458	January 7, 2021

CircuitBreakingException on my ES cluster

Related topics