Circuit breaker not effective

fphilippon · August 29, 2017, 12:17pm

Hi community!

We have a situation on our Elasticsearch cluster where a single request can quickly bring down all cluster nodes (with an OOM exception).

We updated the cluster configuration in order to have following circuits breakers in place:

"indices":{"breaker":{"fielddata":{"limit":"50%"},"request":{"limit":"20%"},"total":{"limit":"30%"}}}}

We also tried to set a size limit for the fielddata cache (40%) but still getting the OOM exception on this request.

Some tips about our cluster topology:

5 data nodes
max heap : 8gb
286 indices
3072 shards
2,857,660,643 docs
4.07TB

elasticsearch node log with exception :

gist.github.com

https://gist.github.com/fphilippon/2e8491d8b82b060d26f5ad891277b955

gistfile1.txt

[2017-08-28T07:43:44,873][WARN ][o.e.m.j.JvmGcMonitorService] [log02] [gc][7388] overhead, spent [8s] collecting in the last [8.5s]
[2017-08-28T07:43:59,520][INFO ][o.e.m.j.JvmGcMonitorService] [log02] [gc][old][7395][16] duration [7.7s], collections [1]/[8.6s], total [7.7s]/[26.3s], memory [7.5gb]->[6.8gb]/[7.9gb
], all_pools {[young] [228.8mb]->[8.2mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [7.3gb]->[6.8gb]/[7.6gb]}
[2017-08-28T07:43:59,520][WARN ][o.e.m.j.JvmGcMonitorService] [log02] [gc][7395] overhead, spent [7.7s] collecting in the last [8.6s]
[2017-08-28T07:44:12,051][WARN ][o.e.m.j.JvmGcMonitorService] [log02] [gc][old][7398][17] duration [10.1s], collections [1]/[10.5s], total [10.1s]/[36.4s], memory [7.4gb]->[6.9gb]/[7.
9gb], all_pools {[young] [22.6mb]->[2.4mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [7.4gb]->[6.9gb]/[7.6gb]}
[2017-08-28T07:44:12,051][WARN ][o.e.m.j.JvmGcMonitorService] [log02] [gc][7398] overhead, spent [10.1s] collecting in the last [10.5s]
[2017-08-28T07:44:24,406][INFO ][o.e.m.j.JvmGcMonitorService] [log02] [gc][old][7402][18] duration [9.2s], collections [1]/[9.3s], total [9.2s]/[45.6s], memory [7.6gb]->[7.4gb]/[7.9gb
], all_pools {[young] [224.4mb]->[3.9mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [7.4gb]->[7.4gb]/[7.6gb]}
[2017-08-28T07:44:24,406][WARN ][o.e.m.j.JvmGcMonitorService] [log02] [gc][7402] overhead, spent [9.2s] collecting in the last [9.3s]

This file has been truncated. show original

detailed query :

gist.github.com

https://gist.github.com/fphilippon/44122d2ae12fa146d6f039f34e32ba6f

gistfile1.txt

[2017-08-28T09:52:05,842][DEBUG][o.e.a.s.TransportSearchAction] [log02] [logstash-2017.08.22][5], node[cp0MHe1gT5WwLUlOcw_XDw], [R], s[STARTED], a[id=qJQgMloPR--aAsVXD0UVhw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[logstash*], indicesOptions=IndicesOptions[id=39, ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, source={
  "size" : 0,
  "query" : {
    "bool" : {
      "filter" : [
        {
          "range" : {
            "@timestamp" : {
              "from" : "1503912042962",
              "to" : "1503913842962",

This file has been truncated. show original

We would like to know how to make sure this kind of requests will not be able to crash our entire cluster and how to go further in the root cause analysis.

Many thanks for your help!

colings86 · August 29, 2017, 12:27pm

Which version of Elasticsearch are you running? If you are running a version prior to 5.4.2 you may be running into these issues: #25010 and #24941. There are also still some known issues around aggregations and OOM which we are tracking in #26012

Mark_Harwood · August 29, 2017, 12:34pm

fielddata”:{“limit”:“50%”}

Fielddata is best avoided if you can use doc values instead. See Support in the Wild: My Biggest Elasticsearch Problem at Scale | Elastic Blog

An average of >10 shards per index and 5 data nodes? Having more shards than data nodes is useful if you plan on expanding out into more data nodes in future but otherwise it's a less efficient way to store the data.

fphilippon · August 29, 2017, 12:41pm

We are running the 5.3.1 version of Elasticsearch.
We will plan soon an upgrade to 5.4.2 and see if we still have issues.

Thanks!

system · September 26, 2017, 12:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OOM for ES: fielddata.cache.size and breaker.fielddata.limit doesn't work Elasticsearch	6	490	July 13, 2018
Circuit Breaker limit Elasticsearch	2	685	July 5, 2017
Circuit breaker to prevent ES client from having OOM problem Elasticsearch	5	862	June 6, 2018
Circuit Breaker Exception Elasticsearch	10	920	February 7, 2022
Circuit breaker in Elasticsearch Elasticsearch	10	211	January 14, 2024

Circuit breaker not effective

Related topics