How to find problematic search which contributes to high load and CPU usages?

Recently our cluster has seen spikes of high load on some nodes. My hunch is that it has to do with some specific searches, but how do you determine which searches are causing it?

I have seen some posts here saying that looking at hot threads would help, but if I look at our own hot threads I am not entirely sure what I should be looking at.

Here’s the result from running

GET /_nodes/hot_threads

I can give you other logs if it’s useful. I want to find about his in docs also because I don’t quite know what other methods I can use to debug this myself.

Check your slow log as well, it should highlight anything.

I did enable slow log but only on one index. I haven’t figured out how to enable slow log for all the indices as we have many. Is there a cluster-level slow log that I can enable?

You should be able to PUT */_settings and then just apply what you have for the one index to all of them.

Didn’t think of that. Thank you! Will enable to report back if I see any anomalies.

A related high-level question about doing calculations directly on the aggregation: we are doing percentage calculations directly on the aggregations so that the returned data already contains some usable results. This can theoretically be done in the program after we have gotten the results. Could this be the reason why the search was slow?

The slow log could only give me so much info — i.e. that a specific query is slow, but not necessarily which part of the query is slow. (Or perhaps it could but I just am not reading it properly)

It might be worth making another topic about optimising the query. You can take a look at the _explain endpoint to get a better idea of what it's doing though.

Are you talking about this: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-explain.html

The explain api computes a score explanation for a query and a specific document. This can give useful feedback whether a document matches or didn’t match a specific query.

Because I don’t understand how that is helping. It asks me to supply a single document in the original result and then show how it is matching — but we are searching 40 million entries so the single document analysis does not actually help me pinpoint what is wrong.

Perhaps you are talking about something different — if so, let me know, thanks!

Yeah, but you're right in that it's not useful here. Not sure what I was thinking there sorry!

I have now been able to see the slow logs and identified the queries are against a specific index — however, the slow log does not show the actual query. We have many different kinds of searches:

  • some are simple
  • some involve complex bool clauses
  • some involve complex aggregations

Having just the index name does not help. Is there a way to find the exact search (perhaps get the body of the search)?

It should be showing up in the source field as per https://www.elastic.co/guide/en/elasticsearch/reference/7.10/index-modules-slowlog.html#_identifying_search_slow_log_origin

You may also want to reduce the default threadholds to a lower number to try to catch more queries.

Related to this, is it in fact possible to output these slow logs as JSON or Logstash or within the internal Kibana monitoring somehow?

Some of the lines are so long that they get truncated.