We have situations where we see big cpu spikes in our ES cluster. Most of the time, a couple of queries will cause the spike. However since our cluster runs thousands of queries/sec, it's hard to determine which query caused the spike.
We have done some work to help investigate these spikes. For example, we monitor the latency that ES returns (the "took" part of the response) as well the client perceived latency. We also track queries that have a ridiculously high number of hits. However, when a bad query hits the system, it ends up stalling the search queue and causing other queries to take longer than usual. However, that's just because of those queries sitting in the queue waiting for the real culprit queries to finish. So if I take some of the queries that took say more than 5s and I re-run, it takes less than 50ms.
I feel ES needs to provide clients with better estimates of query cost to help diagnosing such issues. I'm not too familiar with the ES code base, but some of this would be very helpful if it can easily be obtained:
- time sitting in the queue vs. time actually running. It's very important to have these two separated
- total time worker threads spent on the query
- number of reads a query generated
- some magic "cost" that ES that compute based on factors that are involved in running the query
I'd also love for ES to include a custom string "queryId" argument in the query so that clients can include identifying information about the queries that will help with tracing a query. If you have a large number of query patterns, it's usually hard to figure out what part of the code generated some query. So having some sort of id is very useful. When I used to work on SQL, we used to add comments in the queries that indicated where these queries came from and that was extremely useful in troubleshooting bad queries.
I'd also love to hear how people usually deal with these problems. Seems to be a common question that comes up often here and on stack overflow. I think it's an area where the elastic team should invest more (similar to what was done with the "profile" option).