We have an 8 node cluster and our load (mainly bulk ingest) is pretty high. Earlier the same load was handled well by 6 nodes in ES6.8. Now after moving to 7.8, we see many replica shards get unallocated during load.
allocation api tells the reason as
"details" : "failed shard on node [zC2EkvPLQiWpJ_YjnllD5w]: failed to perform indices:data/write/bulk[s] on replica [10fc5a76ee7042b3ad5bf620ac9fdb39-psrtenant15-fa-cse-asset][0], node[zC2EkvPLQiWpJ_YjnllD5w], [R], s[STARTED], a[id=6xKPtXO5TeyjZL12zRA7rA], failure RemoteTransportException[[psrnativefa112521-esdata4][100.104.145.203:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [indices:data/write/bulk[s][r]] would be [31182253448/29gb], which is larger than the limit of [30601641984/28.5gb], real usage: [31181936024/29gb], new bytes reserved: [317424/309.9kb], usages [request=256/256b, fielddata=64205239/61.2mb, in_flight_requests=60178048/57.3mb, accounting=1148757896/1gb]]; ",`
Issue: Basically parent breaker is hitting the limit of 28.5GB and our heap is 30GB.
If We increase the parent breaker to 29.5GB, I see fewer shards getting un allocated but still the issue resides.
Our JVM args already have below args which should help in this case as per a few old discussions. But not helping much.
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
Please let us know what can be done to avoid this. We can disable this breaker but there will be a purpose for this and don't want to disable it.