I am doing nested aggregation over fields with large cardinality - so, probably the large number of buckets is tripping the circuit breakers. Trying to understand the following circuit breaker exception thoroughly.
shards info {'total': 1188, 'successful': 1145, 'skipped': 0, 'failed': 43, 'failures': [{'index': 'filebeat-zeek-000993', 'node': 'otGgKxl_TbK5ysQKln3uww', 'reason': {'reason': '[parent] Data too large, data for [<agg [total_duration]>] would be [30417917256/28.3gb], which is larger than the limit of [30411066572/28.3gb], real usage: [30417912136/28.3gb], new bytes reserved: [5120/5kb], usages [request=1023301984/975.8mb, fielddata=279576838/266.6mb, in_flight_requests=102276108/97.5mb, accounting=1380379181/1.2gb]', 'bytes_limit': 30411066572, 'bytes_wanted': 30417917256, 'type': 'circuit_breaking_exception', 'durability': 'PERMANENT'}, 'shard': 0}, ...
{'index': 'filebeat-zeek-000993', 'node': 'v3FFEHrGS2CDlq7ohvlG5w', 'reason': {'reason': '[parent] Data too large, data for [<agg [total_duration]>] would be [30414065128/28.3gb], which is larger than the limit of [30411066572/28.3gb], real usage: [30414060008/28.3gb], new bytes reserved: [5120/5kb], usages [request=1722145536/1.6gb, fielddata=276910893/264mb, in_flight_requests=101470356/96.7mb, accounting=1455114445/1.3gb]', 'bytes_limit': 30411066572, 'bytes_wanted': 30414065128, 'type': 'circuit_breaking_exception', 'durability': 'TRANSIENT'}, 'shard': 1}, ...
{'index': 'filebeat-zeek-000994', 'node': 'rZX2LRwkR72-AOWHeZoypw', 'reason': {'reason': '[parent] Data too large, data for [<reused_arrays>] would be [30417187920/28.3gb], which is larger than the limit of [30411066572/28.3gb], real usage: [30417187848/28.3gb], new bytes reserved: [72/72b], usages [request=1655563264/1.5gb, fielddata=311581278/297.1mb, in_flight_requests=44400570/42.3mb, accounting=1387271721/1.2gb]', 'bytes_limit': 30411066572, 'bytes_wanted': 30417187920, 'type': 'circuit_breaking_exception', 'durability': 'TRANSIENT'}, 'shard': 3} ...}
So, there are 3 types of exceptions: [parent] Data too large, data for [<agg [total_duration]>]
with PERMANENT durability, [parent] Data too large, data for [<agg [total_duration]>]
with TRANSIENT durability, and [parent] Data too large, data for [<reused_arrays>]
with TRANSIENT durability.
My questions are:
-
What does
[parent]
signify here? Does it have anything to do with the Parent Circuit Breaker ? -
What does data too large for reused_arrays mean in this context?
-
What are the differences between TRANSIENT and PERMANENT durabilities?
-
Can partitions help lowering the memory overhead for aggregations and prevent tripping the circuit breakers?
-
Can more hardware (number of nodes) and/or breaking up the index into more number of shards help? How to estimate how much hardware will be optimal for a given aggregation query?