In ES 7.8 parent breaker is tripping a lot and causing unallocation of shards

We have an 8 node cluster and our load (mainly bulk ingest) is pretty high. Earlier the same load was handled well by 6 nodes in ES6.8. Now after moving to 7.8, we see many replica shards get unallocated during load.

allocation api tells the reason as

"details" : "failed shard on node [zC2EkvPLQiWpJ_YjnllD5w]: failed to perform indices:data/write/bulk[s] on replica [10fc5a76ee7042b3ad5bf620ac9fdb39-psrtenant15-fa-cse-asset][0], node[zC2EkvPLQiWpJ_YjnllD5w], [R], s[STARTED], a[id=6xKPtXO5TeyjZL12zRA7rA], failure RemoteTransportException[[psrnativefa112521-esdata4][100.104.145.203:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [indices:data/write/bulk[s][r]] would be [31182253448/29gb], which is larger than the limit of [30601641984/28.5gb], real usage: [31181936024/29gb], new bytes reserved: [317424/309.9kb], usages [request=256/256b, fielddata=64205239/61.2mb, in_flight_requests=60178048/57.3mb, accounting=1148757896/1gb]]; ",`

Issue: Basically parent breaker is hitting the limit of 28.5GB and our heap is 30GB.

If We increase the parent breaker to 29.5GB, I see fewer shards getting un allocated but still the issue resides.

Our JVM args already have below args which should help in this case as per a few old discussions. But not helping much.

-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

Please let us know what can be done to avoid this. We can disable this breaker but there will be a purpose for this and don't want to disable it.

Hi Team, any update on this? Few queries

  • How frequently this parent breaker usage is calculated?

  • While calculating parent breaker usage, do you just pick the latest current heap usage? Or after GC usage? As in my case, JVM usage is going to > 29GB several times but GC is bringing it down to <22GB. So this should not be considered as circuit break right?

7.8 is really old, long past EOL, and newer versions are much more memory-efficient.

The first thing to try is to upgrade to a version that hasn't passed EOL.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.