Recommendations for tracking down slow queries

mahdouch · September 28, 2016, 12:11am

We have situations where we see big cpu spikes in our ES cluster. Most of the time, a couple of queries will cause the spike. However since our cluster runs thousands of queries/sec, it's hard to determine which query caused the spike.

We have done some work to help investigate these spikes. For example, we monitor the latency that ES returns (the "took" part of the response) as well the client perceived latency. We also track queries that have a ridiculously high number of hits. However, when a bad query hits the system, it ends up stalling the search queue and causing other queries to take longer than usual. However, that's just because of those queries sitting in the queue waiting for the real culprit queries to finish. So if I take some of the queries that took say more than 5s and I re-run, it takes less than 50ms.

I feel ES needs to provide clients with better estimates of query cost to help diagnosing such issues. I'm not too familiar with the ES code base, but some of this would be very helpful if it can easily be obtained:

time sitting in the queue vs. time actually running. It's very important to have these two separated
total time worker threads spent on the query
number of reads a query generated
some magic "cost" that ES that compute based on factors that are involved in running the query
etc..

I'd also love for ES to include a custom string "queryId" argument in the query so that clients can include identifying information about the queries that will help with tracing a query. If you have a large number of query patterns, it's usually hard to figure out what part of the code generated some query. So having some sort of id is very useful. When I used to work on SQL, we used to add comments in the queries that indicated where these queries came from and that was extremely useful in troubleshooting bad queries.

I'd also love to hear how people usually deal with these problems. Seems to be a common question that comes up often here and on stack overflow. I think it's an area where the elastic team should invest more (similar to what was done with the "profile" option).

Luke_Nezda · September 28, 2016, 1:49am

Looks like Named Queries is "queryId" feature you're looking for: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-named-queries-and-filters.html (just noticed it myself!)

mahdouch · September 28, 2016, 4:23pm

Looks like Named Queries is "queryId" feature you're looking for:

Not exactly. That's for determining which part of a bool query caused a document to match.

warkolm · September 29, 2016, 12:44am

There's slow log and the new profiler API.
We haven't yet had time to expand on the latter!

PRs are also welcome too

Topic		Replies	Views
ElasticSearch 1.7: query time spike, and the application processor crash after that Elasticsearch	10	755	December 29, 2017
Slow first request on an index after a short amount of time Elasticsearch	6	9901	March 13, 2020
Tracing query execution? Elasticsearch	7	2669	July 6, 2017
Query Performance Elasticsearch	11	1824	July 6, 2017
How to track down slow Elasticsearch queries? Elasticsearch	6	1233	April 16, 2019

Recommendations for tracking down slow queries

Related topics