Hey guys,
Recently we had a couple of situations where our ES cluster received an influx of heavy queries and that pretty much killed the cluster. CPU utilization reached 100% on all of the nodes in the cluster meanwhile heap/RAM was doing fine. We had a bunch of queries running for more than 300 seconds that we manually killed using task management API and cluster recovered. Obviously, this is not a preferable way of doing with this.
So the question is: is there any circuit breaker (or anything like that) that would kill long running heavy queries after a certain amount of time (or when CPU util reaches certain threshold)? We have circuit breakers to prevent OOM but there's nothing for the CPU utilization as far as I can tell after checking the documentation.
Thanks!