Huge aggregation triggers CircuitBreaker in loop


I was trying to make aggregations on a very huge index. The search triggered the circuit breaker which was expected given the size of the index (billion of documents):

elastic1_1          | "Caused by: org.elasticsearch.transport.RemoteTransportException: [data03][][indices:data/write/bulk]",
elastic1_1          | "Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/bulk] would be [31664846682/29.4gb], which is larger than the limit of [31621696716/29.4gb], real usage: [31664842872/29.4gb], new bytes reserved: [3810/3.7kb], usages [request=0/0b, fielddata=27198847678/25.3gb, in_flight_requests=3810/3.7kb, model_inference=0/0b, eql_sequence=0/0b, accounting=109554198/104.4mb]",
elastic1_1          | "at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit( ~[elasticsearch-7.16.2.jar:7.16.2]",
elastic1_1          | "at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak( ~[elasticsearch-7.16.2.jar:7.16.2]",
elastic1_1          | "at org.elasticsearch.transport.InboundAggregator.checkBreaker( ~[elasticsearch-7.16.2.jar:7.16.2]",
elastic1_1          | "at org.elasticsearch.transport.InboundAggregator.finishAggregation( ~[elasticsearch-7.16.2.jar:7.16.2]",
elastic1_1          | "at org.elasticsearch.transport.InboundPipeline.forwardFragments( ~[elasticsearch-7.16.2.jar:7.16.2]",
elastic1_1          | "at org.elasticsearch.transport.InboundPipeline.doHandleBytes( ~[elasticsearch-7.16.2.jar:7.16.2]",
elastic1_1          | "at org.elasticsearch.transport.InboundPipeline.handleBytes( ~[elasticsearch-7.16.2.jar:7.16.2]",

The problem is that the error is triggered in loop, causing then further errors (incoming documents cannot be indexed, node disappearing, etc).

It looks like the cluster is trying to perform the aggregation request in loop and never cancels it even if the CircuitBreaker was hit. Does that makes sense?



I am still having the same issue. Is there a way to automatically cancel a search which triggers the CircuitBreaker? It seems the nodes in my cluster are trying infinitely to calculate the aggregations thus triggering the CircuitBreaker in loop.

The circuit breaker above is about bulk indexing, and not about querying (might be after your initial issue). Are you sending huge or many bulk requests as well?

Also the Elasticsearch version you are using might help to get a first idea.


You are totally right, I overlooked some details in the error message because I was focusing on the "InboundAggregator.finishAggregation" line.

Here are some more details:

  • Elastic version: Elasticsearch-7.16.2.jar
  • How to trigger the problem: launch an aggregation on a field with very high cardinality
  • What happens: fielddata cache increase to a point where circuitbreaker triggers because there is no more heap available

I assumed that it was the aggregation query that triggered the CircuitBreaker but you are right, it might be other operations like a small bulk indexing because there is no more heap available.

Looking at stats I can clearly see that fielddata cache uses all heap memory because of the aggregations.

Problem is that the fielddata cache is never flushed even though the aggregations was doomed to fail because of lack of heap memory. This has a side effects of killing our cluster (nearly every other action will trigger the CircuitBrealer in loop).

Also a side note, I managed to reproduce while doing aggregations on fields which have nearly hundred of millions of different values, but also on flattened fields. The strange part with flattened fields is that I am asking for an aggregation on for example, and it seems Elastic is using heap to calculate aggregation on document.* in every index even though is only present in a few indexes (I might be wrong here, debugging is quite tedious).

Solution I found: don't ask for aggregations on very high cardinality fields
What I expected:

  • elastic would filter first the documents then calculate aggregations for flattened types. It seems it does in reverse: calculate all possible aggregations on flattened types, then filter. Not sure about this, but looking at fielddata cache per index, that's what happens
  • some sore of automatic fielddata flushing instead of hitting the CircuitBreaker in loop for every other action


I haven't found a solution to our problem yet. I see many posts about CircuitBreaker errors, but few (if any) answers.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.