Circuit breaker exception resulted in temination of all the nodes of Elasticsearch

Hi All,

I am new to ELK and tried load testing for the first time, I did some heavy load testing in ELK and tried to create a report for last three months .When I ran the command . I can see shards started failing and all the elasticsearch nodes and logstash services moved to stopped state. In the logstash logs , I got below error

[2020-07-24T12:31:50,219][INFO ][logstash.outputs.elasticsearch][nir-esim-gdsp_pipeline][c042dc0baedb208c3ba6bede824f0d8ed0fa8c3a85c5914726e4dfce3f7315bb] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>1}
[2020-07-24T12:31:56,732][INFO ][logstash.outputs.elasticsearch][first_pipeline][c042dc0baedb208c3ba6bede824f0d8ed0fa8c3a85c5914726e4dfce3f7315bb] retrying failed action with response code: 429 ({"type"=>"circuit_breaking_exception", "reason"=>"[parent] Data too large, data for [<transport_request>] would be [1050524440/1001.8mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1050521096/1001.8mb], new bytes reserved: [3344/3.2kb], usages [request=24208/23.6kb, fielddata=38134/37.2kb, in_flight_requests=67096/65.5kb, accounting=10372352/9.8mb]", "bytes_wanted"=>1050524440, "bytes_limit"=>1020054732, "durability"=>"PERMANENT"})

Can I do some settings that instead of elasticsearch going down, it can just reject or give timeout error in kibana. As ELK cluster going down will pile up all the logs in source system

Hey @rohitarorait82!

Unfortunately, there's no way to swallow these errors and keep ES up and running while it's overloaded.

I'd recommend reading this blog post about the issue and trying to tune your ES cluster to be a better fit for the data you have.

Thanks @myasonik for you reply.

I am just trying to run below query in ELK for very huge data. Is there a way to find out maximum time range which I can use in this query.

GET /my_index/_search
{
"size" :0,
"aggs": {
"2": {
"terms": {
"field": "API.keyword",
"order": {
"1": "desc"
},
"size": 500
},
"aggs": {
"1": {
"cardinality": {
"field": "correl.keyword"
}
},
"3": {
"terms": {
"field": "Consumer.keyword",
"order": {
"1": "desc"
},
"size": 50
},
"aggs": {
"1": {
"cardinality": {
"field": "correl.keyword"
}
}
}
}
}
}
},
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "2020-07-30T05:49:39.444Z",
"lte": "2020-07-30T05:50:39.444Z",
"format": "strict_date_optional_time"
}
}
}
]
}
}
}

Hey @rohitarorait82! Sorry about the delayed response. Just talked with our easticsearch team... Ordering terms aggs by cardinality is just a really expensive query so you're prone to run into issues like this.

Some other things I learned:

  • The circuit breaker exception should be non-fatal so there might be something else going on there if your nodes are really going down (however it can be fatal sometimes)
  • Another thing, the circuit breaker exception gets tripped just when the overall memory is over a certain threshold so there might be something else chewing through some of your available memory, not just this query (though this is an expensive query)
  • A composite agg should be more efficient if you're trying to see a lot of results but that might affect your logstash setup

@myasonik Thanks a lot , I will check and try to implement all these suggestions

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.