ES getting killed by heavy queries

Hey guys,

Recently we had a couple of situations where our ES cluster received an influx of heavy queries and that pretty much killed the cluster. CPU utilization reached 100% on all of the nodes in the cluster meanwhile heap/RAM was doing fine. We had a bunch of queries running for more than 300 seconds that we manually killed using task management API and cluster recovered. Obviously, this is not a preferable way of doing with this.

So the question is: is there any circuit breaker (or anything like that) that would kill long running heavy queries after a certain amount of time (or when CPU util reaches certain threshold)? We have circuit breakers to prevent OOM but there's nothing for the CPU utilization as far as I can tell after checking the documentation.

Thanks!

See the search timeout option, which by default is unbounded: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#_parameters_4

Is there a way to set the timeout on cluster side (config/API call) rather than the client side?

Yep: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search.html#global-search-timeout

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.