Recently our cluster has seen spikes of high load on some nodes. My hunch is that it has to do with some specific searches, but how do you determine which searches are causing it?
I have seen some posts here saying that looking at hot threads would help, but if I look at our own hot threads I am not entirely sure what I should be looking at.
Here’s the result from running
GET /_nodes/hot_threads
I can give you other logs if it’s useful. I want to find about his in docs also because I don’t quite know what other methods I can use to debug this myself.
I did enable slow log but only on one index. I haven’t figured out how to enable slow log for all the indices as we have many. Is there a cluster-level slow log that I can enable?
Didn’t think of that. Thank you! Will enable to report back if I see any anomalies.
A related high-level question about doing calculations directly on the aggregation: we are doing percentage calculations directly on the aggregations so that the returned data already contains some usable results. This can theoretically be done in the program after we have gotten the results. Could this be the reason why the search was slow?
The slow log could only give me so much info — i.e. that a specific query is slow, but not necessarily which part of the query is slow. (Or perhaps it could but I just am not reading it properly)
It might be worth making another topic about optimising the query. You can take a look at the _explain endpoint to get a better idea of what it's doing though.
The explain api computes a score explanation for a query and a specific document. This can give useful feedback whether a document matches or didn’t match a specific query.
Because I don’t understand how that is helping. It asks me to supply a single document in the original result and then show how it is matching — but we are searching 40 million entries so the single document analysis does not actually help me pinpoint what is wrong.
Perhaps you are talking about something different — if so, let me know, thanks!
I have now been able to see the slow logs and identified the queries are against a specific index — however, the slow log does not show the actual query. We have many different kinds of searches:
some are simple
some involve complex bool clauses
some involve complex aggregations
Having just the index name does not help. Is there a way to find the exact search (perhaps get the body of the search)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.