How to protect an ES cluster from searches that would kill it?

Chris_Neal · July 13, 2015, 10:58pm

Unfortunately, this really happened to me in prod yesterday. It was ugly.

I have a three tiered cluster with masters and clients and data nodes all separated.

My 4 clients have these settings (among others) for a 30GB heap:

 "indices.fielddata.cache.size": "60%",
 "indices.breaker.total.limit": "75%",
 "indices.breaker.request.limit": "50%",
 "indices.breaker.fielddata.limit": "65%",
 "threadpool.bulk.queue_size": "500",
 "threadpool.bulk.size": "32",
 "threadpool.index.queue_size": "500",
 "threadpool.index.size": "32",
 "threadpool.search.queue_size": "2000",

My 6 data nodes have these settings:

 "indices.fielddata.cache.size": "30%",
 "indices.breaker.total.limit": "70%",
 "indices.breaker.request.limit": "30%",
 "indices.breaker.fielddata.limit": "35%",
 "indices.memory.index_buffer_size": "60%",

I saw TONS (1440 in about 1.5 hrs) of these:

org.elasticsearch.ElasticsearchException: org.elasticsearch.common.breaker.CircuitBreakingException: [FIELDDATA] Data too large, data for [event_detail] would be larger than limit of [11274289152/10.5gb]

And these (31K in about 1.5 hours):

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 2000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@3c34ae72

which had ugly downstream effects that I won't go in to. Obviously some instruction on proper search techniques is in order, but what else from a cluster perspective can be done to help keep searches from killing the cluster.

M thoughts:

I think my search queue of 2000 is WAY too high. Maybe 20 instead.
I think my indices.breaker.fielddata.limit is WAY too high. What field needs to return 10GB of data? That should be much lower. Maybe 5%? Which is still 1.5GB for a single field.
Same thing for indices.breaker.request.limit. 50% is 15GB, 30% is 9GB. That sounds outrageously high. Again, maybe 5%?

Thanks for the insight.
Chris

warkolm · July 13, 2015, 11:16pm

Yes to all your points. Increasing these is usually only a bandaid fix and you end up pushing the problem to somewhere else.

Also look into doc values.

Chris_Neal · July 13, 2015, 11:52pm

Thanks Mark. I do have doc_values set up for all my mappings also

So, here's another question. All those settings are updateable via the cluster API, which means they are cluster-wide settings. Right now I have two sets of configs, one for clients and one for data nodes. Does a node keep its "local" settings from its elasticsearch.yml file if it differs from what the other node types have? I'm getting hung up on the scope of what is called "cluster-wide", but can also be specified at a node-local yml file.

I'd like to have different settings on my clients than on my data nodes, if that is possible.

Chris

warkolm · July 14, 2015, 1:28am

I think it goes node > cluster, but you should be able to tell using the _nodes API.

Chris_Neal · July 14, 2015, 2:12am

Welp, here's what I did:

Update the cluster API to the whole cluster as such:

PUT /_cluster/settings?master_timeout=3000000
{
    "persistent" : {
        "threadpool.search.queue_size" : 20,
        "indices.breaker.request.limit": "30%",
        "indices.breaker.fielddata.limit": "35%"
    }
}

Saw it take effect in the logs:

[2015-07-14 01:45:10,844][INFO ][indices.breaker          ] [elasticsearch-bdprodes10] Updating settings parent: [PARENT,type=PARENT,limit=22548578304/21gb,overhead=1.0], fielddata: [FIELDDATA,type=MEMORY,limit=11274289152/10.5gb,overhead=1.03], request: [REQUEST,type=MEMORY,limit=9663676416/9gb,overhead=1.0]

Then updated the elasticsearch.yml file on just the client nodes as such:

indices:
  breaker:
    fielddata:
      limit: 5%
    request:
      limit: 5%
    total:
      limit: 75%
  fielddata:
    cache:
      size: 60%

Then cycled only the client nodes, presumably to take these new settings, but on startup, they took the same as the data nodes/cluster settings:

[2015-07-13 21:47:29,875][INFO ][indices.breaker          ] [elasticsearch-bdprodes01] Updating settings parent: [PARENT,type=PARENT,limit=24159191040/22.5gb,overhead=1.0], fielddata: [FIELDDATA,type=MEMORY,limit=11274289152/10.5gb,overhead=1.03], request: [REQUEST,type=MEMORY,limit=9663676416/9gb,overhead=1.0]

The _nodes API confirmed the same. I thought this would do it, but it doesn't look like it. Perhaps there is a different way to accomplish this?

Chris

Chris_Neal · July 14, 2015, 8:40pm

Still trying to get this to work right. No luck yet getting nodes to have independent configs from the cluster defined ones.

Is it possibly documented somewhere? I'm not finding it, but I could be missing it.

Many thanks!
Chris

Topic		Replies	Views
Circuit Breaking Exception Elasticsearch	11	468	February 16, 2021
How to prevent es cluster crash from deep aggregations Elasticsearch	10	3116	July 5, 2017
OOM for ES: fielddata.cache.size and breaker.fielddata.limit doesn't work Elasticsearch	6	490	July 13, 2018
ES getting killed by heavy queries Elasticsearch	4	4236	April 13, 2018
Elasticsearch heap issues Elasticsearch	4	441	July 5, 2017

How to protect an ES cluster from searches that would kill it?

Related topics