Hi,
I am trying to size a production cluster based on dev-setup where I encountered Data too large .. [transport_request]
(CircuitBreakingException
) on [parent] CircuitBreaker
. Since retries are also exhausted, some shards get stuck in UNASSIGNED
state and cluster-state becomes RED
.
I tried recovering the cluster using reroute
but the it leaves the cluster in YELLOW
state again hitting the CircuitBreakingException
Checked few discussion threads in this context and most of the recommendations were around increasing heap size per instance as well as using G1GC.
My cluster config is as follows:
Environment: Kubernetes
Elasticsearch version : 7.0.1
Cluster setup : 3 physical nodes with 4 instances of ES per node (each instance running as a container within dedicated Kubernetes Pod)
Total Data nodes : 12 data nodes
Heap allocated per ES Instance Container : 14GB
Total memory allocated per ES instance (docker-container) : 28GB
Inter-nodal link bandwidth across physical nodes : 10Gbps
Largest shard size per instance : 11GB
Next Largest shard size instance: 8GB
Test Scenario:
For testing purposes, I have suppressed indexing traffic as well as querying traffic leaving all other configurations intact and bounced just one instance of ES out of 12.
Issue Reproduction
As a reproduction in small scale, I created a small cluster - 3 ES instances running in docker containers and created one index with 5 shards - with 5 primaries and 5 replicas and loaded the same with dummy data upto 95MB shard-size. Heap-size allocated per instance was 512MB.
Observations during reproduction of issue
Even in this small reproduction setup, I was able to see [parent]
breaker
tripping at least 2-3 times (and of course it recovered) in repeated testing. It was not as bad as the scaled cluster. But still same exception could be seen with repeated bouncing of ES instances.
My clarifications regarding the reproduction:
-
On-disk shard-size is 95MB. In worst case even if
2
shards get allocated in peer-recovery mode concurrently (indices.recovery.max_concurrent_file_chunks
), heap-memory requirement would at the most be ~200MB total - please correct me if I am wrong. What are the other heap requirements during the shard allocation (please note that no indexing or query load was run while shard allocation was in progress) ? -
Would reducing
indices.recovery.max_concurrent_file_chunks
- from default value of 2 to 1, would reduce the demand on heap ?
NOTE: I understand that indices.recovery.max_concurrent_file_chunks
reduction can slow-down recovery. But I wanted to be clear if concurrent recoveries of more than one shard is what is putting demand on the heap causing CircuitBreakingException
.
Thanks in advance for any suggestions and advises
- Dinesh