Unfortunately, this really happened to me in prod yesterday. It was ugly.
I have a three tiered cluster with masters and clients and data nodes all separated.
My 4 clients have these settings (among others) for a 30GB heap:
"indices.fielddata.cache.size": "60%",
"indices.breaker.total.limit": "75%",
"indices.breaker.request.limit": "50%",
"indices.breaker.fielddata.limit": "65%",
"threadpool.bulk.queue_size": "500",
"threadpool.bulk.size": "32",
"threadpool.index.queue_size": "500",
"threadpool.index.size": "32",
"threadpool.search.queue_size": "2000",
My 6 data nodes have these settings:
"indices.fielddata.cache.size": "30%",
"indices.breaker.total.limit": "70%",
"indices.breaker.request.limit": "30%",
"indices.breaker.fielddata.limit": "35%",
"indices.memory.index_buffer_size": "60%",
I saw TONS (1440 in about 1.5 hrs) of these:
org.elasticsearch.ElasticsearchException: org.elasticsearch.common.breaker.CircuitBreakingException: [FIELDDATA] Data too large, data for [event_detail] would be larger than limit of [11274289152/10.5gb]
And these (31K in about 1.5 hours):
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 2000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@3c34ae72
which had ugly downstream effects that I won't go in to. Obviously some instruction on proper search techniques is in order, but what else from a cluster perspective can be done to help keep searches from killing the cluster.
M thoughts:
- I think my search queue of 2000 is WAY too high. Maybe 20 instead.
- I think my indices.breaker.fielddata.limit is WAY too high. What field needs to return 10GB of data? That should be much lower. Maybe 5%? Which is still 1.5GB for a single field.
- Same thing for indices.breaker.request.limit. 50% is 15GB, 30% is 9GB. That sounds outrageously high. Again, maybe 5%?
Thanks for the insight.
Chris