I am trying to size a production cluster based on dev-setup where I encountered
Data too large .. [transport_request] (
[parent] CircuitBreaker. Since retries are also exhausted, some shards get stuck in
UNASSIGNED state and cluster-state becomes
I tried recovering the cluster using
reroute but the it leaves the cluster in
YELLOW state again hitting the
Checked few discussion threads in this context and most of the recommendations were around increasing heap size per instance as well as using G1GC.
My cluster config is as follows:
Elasticsearch version : 7.0.1
Cluster setup : 3 physical nodes with 4 instances of ES per node (each instance running as a container within dedicated Kubernetes Pod)
Total Data nodes : 12 data nodes
Heap allocated per ES Instance Container : 14GB
Total memory allocated per ES instance (docker-container) : 28GB
Inter-nodal link bandwidth across physical nodes : 10Gbps
Largest shard size per instance : 11GB
Next Largest shard size instance: 8GB
For testing purposes, I have suppressed indexing traffic as well as querying traffic leaving all other configurations intact and bounced just one instance of ES out of 12.
As a reproduction in small scale, I created a small cluster - 3 ES instances running in docker containers and created one index with 5 shards - with 5 primaries and 5 replicas and loaded the same with dummy data upto 95MB shard-size. Heap-size allocated per instance was 512MB.
Observations during reproduction of issue
Even in this small reproduction setup, I was able to see
breaker tripping at least 2-3 times (and of course it recovered) in repeated testing. It was not as bad as the scaled cluster. But still same exception could be seen with repeated bouncing of ES instances.
My clarifications regarding the reproduction:
On-disk shard-size is 95MB. In worst case even if
2shards get allocated in peer-recovery mode concurrently (
indices.recovery.max_concurrent_file_chunks), heap-memory requirement would at the most be ~200MB total - please correct me if I am wrong. What are the other heap requirements during the shard allocation (please note that no indexing or query load was run while shard allocation was in progress) ?
indices.recovery.max_concurrent_file_chunks- from default value of 2 to 1, would reduce the demand on heap ?
NOTE: I understand that
indices.recovery.max_concurrent_file_chunks reduction can slow-down recovery. But I wanted to be clear if concurrent recoveries of more than one shard is what is putting demand on the heap causing
Thanks in advance for any suggestions and advises