Circuit breaker not effective

Hi community!

We have a situation on our Elasticsearch cluster where a single request can quickly bring down all cluster nodes (with an OOM exception).

We updated the cluster configuration in order to have following circuits breakers in place:

"indices":{"breaker":{"fielddata":{"limit":"50%"},"request":{"limit":"20%"},"total":{"limit":"30%"}}}}

We also tried to set a size limit for the fielddata cache (40%) but still getting the OOM exception on this request.

Some tips about our cluster topology:

  • 5 data nodes
  • max heap : 8gb
  • 286 indices
  • 3072 shards
  • 2,857,660,643 docs
  • 4.07TB

elasticsearch node log with exception :

detailed query :

We would like to know how to make sure this kind of requests will not be able to crash our entire cluster and how to go further in the root cause analysis.

Many thanks for your help! :slight_smile:

Which version of Elasticsearch are you running? If you are running a version prior to 5.4.2 you may be running into these issues: #25010 and #24941. There are also still some known issues around aggregations and OOM which we are tracking in #26012

fielddata”:{“limit”:“50%”}

Fielddata is best avoided if you can use doc values instead. See Support in the Wild: My Biggest Elasticsearch Problem at Scale | Elastic Blog

An average of >10 shards per index and 5 data nodes? Having more shards than data nodes is useful if you plan on expanding out into more data nodes in future but otherwise it's a less efficient way to store the data.

We are running the 5.3.1 version of Elasticsearch.
We will plan soon an upgrade to 5.4.2 and see if we still have issues.

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.