Circuit breaker always trips

Attila_Nagy · November 24, 2017, 5:29pm

I have a cluster of 40 nodes. Today Elasticsearch started to return circuit breaker exceptions for all queries on two given nodes.
Even for this:

$ curl -s 'http://localhost:9200/'

{
   "error":{
      "root_cause":[
         {
            "type":"circuit_breaking_exception",
            "reason":"[parent] Data too large, data for [<http_request>] would be [13610582016/12.6gb], which is larger than the limit of [11885484441/11gb]",
            "bytes_wanted":13610582016,
            "bytes_limit":11885484441
         }
      ],
      "type":"circuit_breaking_exception",
      "reason":"[parent] Data too large, data for [<http_request>] would be [13610582016/12.6gb], which is larger than the limit of [11885484441/11gb]",
      "bytes_wanted":13610582016,
      "bytes_limit":11885484441
   },
   "status":503
}

How can I figure out what is the problem? Should the above log say how much memory that given query needs?
And if it shows that, what's going on? How can the above (and all other queries to those nodes) die with this error?
I see that the request breakers trip on those nodes. After cluster restart the same continues in some minutes.

If we reach the above situation, the node remains in the cluster but every operation to it fails.

Elasticsearch 5.6.4.

dadoonet · November 24, 2017, 5:48pm

That looks weird.

Could you share your mappings?

Attila_Nagy · November 24, 2017, 5:57pm

I've sent it in a private message.
BTW, yesterday we started to use some new indexes with bigger docs (updated with scripts in bulk and returning the source in the bulk call), otherwise I don't know anything which has changed.
But I've already disabled that code and the problem still persists.
Bigger here means an enabled: false object and some (1-6) keys in the following structure:

gist.github.com

https://gist.github.com/bra-fsn/5997a86d869467d595dc14f1f05e4d27

ipstat.doc

{
  "_index" : "ip_stats",
  "_type" : "stats",
  "_id" : "1.2.3.4",
  "_version" : 3648,
  "found" : true,
  "_source" : {
    "counters" : {
      "smtpConnect" : {
        "d" : {

This file has been truncated. show original

dadoonet · November 24, 2017, 9:55pm

I don’t understand why the circuit breaker complains with a match_all query.

@dakrone does this reminds you anything?

Attila_Nagy · November 24, 2017, 10:01pm

I've started to graph estimated breaker size:

According to the logs, the first entries are -so far- a repeated aggregation, which may be right, or not, I don't yet know:
[2017-11-24T21:52:55,425][WARN ][o.e.i.b.request ] [request] New used memory 10664646024 [9.9gb] for data of [<agg [messagesByFolders]>] would be larger than configured breaker: 10213706956 [9.5gb], breaking
[2017-11-24T21:52:56,049][WARN ][o.e.i.b.request ] [request] New used memory 11342156800 [10.5gb] for data of [<agg [messagesByFolders]>] would be larger than configured breaker: 10213706956 [9.5gb], breaking
[2017-11-24T21:52:56,666][WARN ][o.e.i.b.request ] [request] New used memory 11342156800 [10.5gb] for data of [<agg [messagesByFolders]>] would be larger than configured breaker: 10213706956 [9.5gb], breaking
[2017-11-24T21:52:57,824][WARN ][o.e.i.b.request ] [request] New used memory 11342156800 [10.5gb] for data of [<agg [messagesByFolders]>] would be larger than configured breaker: 10213706956 [9.5gb], breaking

But what I don't understand is why does it affect even a simple root query (http://localhost:9200/). Shouldn't breakers stop just the query which is over the allowed size? Why the estimated size grows linearly and denies all subsequent requests on the affected node(s) after some minutes?

Do I misunderstand the concept of breakers entirely?

dadoonet · November 24, 2017, 10:28pm

I don’t know.

But for sure, I saw in you mappings that you disabled doc_values. Was there a reason for doing so?

Attila_Nagy · November 24, 2017, 10:30pm

Nothing more than saving space where we don't use them.

Attila_Nagy · November 25, 2017, 12:51am

OK, so the questions seem to be:

why does the request breakers' size constantly grow until it hits the limit and no query can get in? it's not a single query, but a bunch of queries (hence the steep curve on the graph, but it's not like a dirac delta).
"breakers" : {
"request" : {
"limit_size_in_bytes" : 10187558092,
"limit_size" : "9.4gb",
"estimated_size_in_bytes" : 11343200256,
"estimated_size" : "10.5gb",
"overhead" : 1.0,
"tripped" : 250
},
how could I see what queries account for these values?
how can it be that the JVM heap size (reported by elastic) doesn't show this increased memory usage? (maybe a counter leak, it doesn't get decremented after a query finishes?)
why is this localized to two nodes only (i will try to narrow it down to given indices, but it seems it happens only where the primary and replica shards of a given index are)

Also I think that the log/returned error (currently: [2017-11-24T21:52:55,425][WARN ][o.e.i.b.request ] [request] New used memory 10664646024 [9.9gb] for data of [<agg [messagesByFolders]>] would be larger than configured breaker: 10213706956 [9.5gb], breaking) should also include that individual query's estimated memory requirement, so if the estimated size grows by gigabytes every seconds, it may be traced better from the logs.

BTW, this is very similar to https://github.com/elastic/elasticsearch/issues/26943, except that I have a very rapid explosion in request breaker's size. (and doesn't have any plugins apart from icu)

Attila_Nagy · November 25, 2017, 8:30am

OK, I've raised the breakers' limit to 2^63-1 bytes, it made the cluster stable.

So it seems this is a counter bug/leak, obviously no node can use 170 GiB of heap.

And I think I found the root cause.
Opened issue:

dakrone · November 29, 2017, 3:19am

Yep, that's definitely it. It's an issue where the stream is not closed, so the breaker isn't decrementing the accounting by the number of bytes. There's a PR open now to fix this.

system · December 27, 2017, 3:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Circuit breaker problem Elasticsearch	4	1100	November 30, 2020
Request Circuit Breaker keeps tripping; how is the estimate calculated? Elasticsearch	4	5348	April 3, 2017
CircuitBreakingException - Data too large Elastic Tips and Common Fixes elasticsearch	1	6028	November 4, 2022
Circuit Breaker [parent] Data too large, data for [<http_request>] Elasticsearch	2	3549	August 7, 2017
"Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Elasticsearch	4	762	December 22, 2020

Circuit breaker always trips

Related topics