How to protect an ES cluster from searches that would kill it?

Chris_Neal · July 13, 2015, 10:58pm

Unfortunately, this really happened to me in prod yesterday. It was ugly.

I have a three tiered cluster with masters and clients and data nodes all separated.

My 4 clients have these settings (among others) for a 30GB heap:

 "indices.fielddata.cache.size": "60%",
 "indices.breaker.total.limit": "75%",
 "indices.breaker.request.limit": "50%",
 "indices.breaker.fielddata.limit": "65%",
 "threadpool.bulk.queue_size": "500",
 "threadpool.bulk.size": "32",
 "threadpool.index.queue_size": "500",
 "threadpool.index.size": "32",
 "threadpool.search.queue_size": "2000",

My 6 data nodes have these settings:

 "indices.fielddata.cache.size": "30%",
 "indices.breaker.total.limit": "70%",
 "indices.breaker.request.limit": "30%",
 "indices.breaker.fielddata.limit": "35%",
 "indices.memory.index_buffer_size": "60%",

I saw TONS (1440 in about 1.5 hrs) of these:

org.elasticsearch.ElasticsearchException: org.elasticsearch.common.breaker.CircuitBreakingException: [FIELDDATA] Data too large, data for [event_detail] would be larger than limit of [11274289152/10.5gb]

And these (31K in about 1.5 hours):

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 2000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@3c34ae72

which had ugly downstream effects that I won't go in to. Obviously some instruction on proper search techniques is in order, but what else from a cluster perspective can be done to help keep searches from killing the cluster.

M thoughts:

I think my search queue of 2000 is WAY too high. Maybe 20 instead.
I think my indices.breaker.fielddata.limit is WAY too high. What field needs to return 10GB of data? That should be much lower. Maybe 5%? Which is still 1.5GB for a single field.
Same thing for indices.breaker.request.limit. 50% is 15GB, 30% is 9GB. That sounds outrageously high. Again, maybe 5%?

Thanks for the insight.
Chris

warkolm · July 13, 2015, 11:16pm

Yes to all your points. Increasing these is usually only a bandaid fix and you end up pushing the problem to somewhere else.

Also look into doc values.

Chris_Neal · July 13, 2015, 11:52pm

Thanks Mark. I do have doc_values set up for all my mappings also

So, here's another question. All those settings are updateable via the cluster API, which means they are cluster-wide settings. Right now I have two sets of configs, one for clients and one for data nodes. Does a node keep its "local" settings from its elasticsearch.yml file if it differs from what the other node types have? I'm getting hung up on the scope of what is called "cluster-wide", but can also be specified at a node-local yml file.

I'd like to have different settings on my clients than on my data nodes, if that is possible.

Chris

warkolm · July 14, 2015, 1:28am

I think it goes node > cluster, but you should be able to tell using the _nodes API.

Chris_Neal · July 14, 2015, 2:12am

Welp, here's what I did:

Update the cluster API to the whole cluster as such:

PUT /_cluster/settings?master_timeout=3000000
{
    "persistent" : {
        "threadpool.search.queue_size" : 20,
        "indices.breaker.request.limit": "30%",
        "indices.breaker.fielddata.limit": "35%"
    }
}

Saw it take effect in the logs:

[2015-07-14 01:45:10,844][INFO ][indices.breaker          ] [elasticsearch-bdprodes10] Updating settings parent: [PARENT,type=PARENT,limit=22548578304/21gb,overhead=1.0], fielddata: [FIELDDATA,type=MEMORY,limit=11274289152/10.5gb,overhead=1.03], request: [REQUEST,type=MEMORY,limit=9663676416/9gb,overhead=1.0]

Then updated the elasticsearch.yml file on just the client nodes as such:

indices:
  breaker:
    fielddata:
      limit: 5%
    request:
      limit: 5%
    total:
      limit: 75%
  fielddata:
    cache:
      size: 60%

Then cycled only the client nodes, presumably to take these new settings, but on startup, they took the same as the data nodes/cluster settings:

[2015-07-13 21:47:29,875][INFO ][indices.breaker          ] [elasticsearch-bdprodes01] Updating settings parent: [PARENT,type=PARENT,limit=24159191040/22.5gb,overhead=1.0], fielddata: [FIELDDATA,type=MEMORY,limit=11274289152/10.5gb,overhead=1.03], request: [REQUEST,type=MEMORY,limit=9663676416/9gb,overhead=1.0]

The _nodes API confirmed the same. I thought this would do it, but it doesn't look like it. Perhaps there is a different way to accomplish this?

Chris

Chris_Neal · July 14, 2015, 8:40pm

Still trying to get this to work right. No luck yet getting nodes to have independent configs from the cluster defined ones.

Is it possibly documented somewhere? I'm not finding it, but I could be missing it.

Many thanks!
Chris

Topic		Replies	Views
Facing data too large exception frequently Elasticsearch	5	667	December 24, 2020
Cache.size and breaker limits on client-only nodes vs. data-only nodes Elasticsearch	2	823	July 6, 2017
Circuit Breaker limit Elasticsearch	2	719	July 5, 2017
CircuitBreakingException: [parent] Data too large Elasticsearch	5	808	October 19, 2021
Data too large, data for [<transport_request>] Elasticsearch es-hadoop	17	2400	January 25, 2021

How to protect an ES cluster from searches that would kill it?

Related topics