How to find problematic search which contributes to high load and CPU usages?

smlbiobot · November 3, 2020, 2:34am

Recently our cluster has seen spikes of high load on some nodes. My hunch is that it has to do with some specific searches, but how do you determine which searches are causing it?

I have seen some posts here saying that looking at hot threads would help, but if I look at our own hot threads I am not entirely sure what I should be looking at.

Here’s the result from running

GET /_nodes/hot_threads

gist.github.com

https://gist.github.com/smlbiobot/3970dd153007ff07609c74e6a4c462f1

hot-threads-20201103.txt

::: {esaux1}{yiEO_THLRpuMyIoVDwWqUw}{xnBnaGMYTJGy6W4HrHOdIQ}{172.104.108.156}{172.104.108.156:9300}{xpack.installed=true}
   Hot threads at 2020-11-03T02:29:08.517, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {linode5}{gkndbKNqQfuW0HMl9FLHlg}{bAOg0oeZSTeg5sfs4z7xxA}{172.104.96.121}{172.104.96.121:9300}{ml.machine_memory=33728798720, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, ml.enabled=true}
   Hot threads at 2020-11-03T02:29:08.527, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   103.4% (517.2ms out of 500ms) cpu usage by thread 'elasticsearch[linode5][search][T#4]'
     2/10 snapshots sharing following 33 elements
       app//org.apache.lucene.index.OrdinalMap.<init>(OrdinalMap.java:266)
       app//org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:168)

This file has been truncated. show original

I can give you other logs if it’s useful. I want to find about his in docs also because I don’t quite know what other methods I can use to debug this myself.

warkolm · November 4, 2020, 5:43am

Check your slow log as well, it should highlight anything.

smlbiobot · November 4, 2020, 12:06pm

I did enable slow log but only on one index. I haven’t figured out how to enable slow log for all the indices as we have many. Is there a cluster-level slow log that I can enable?

warkolm · November 4, 2020, 9:46pm

You should be able to PUT */_settings and then just apply what you have for the one index to all of them.

smlbiobot · November 5, 2020, 1:26am

Didn’t think of that. Thank you! Will enable to report back if I see any anomalies.

A related high-level question about doing calculations directly on the aggregation: we are doing percentage calculations directly on the aggregations so that the returned data already contains some usable results. This can theoretically be done in the program after we have gotten the results. Could this be the reason why the search was slow?

The slow log could only give me so much info — i.e. that a specific query is slow, but not necessarily which part of the query is slow. (Or perhaps it could but I just am not reading it properly)

warkolm · November 5, 2020, 1:31am

It might be worth making another topic about optimising the query. You can take a look at the _explain endpoint to get a better idea of what it's doing though.

smlbiobot · November 5, 2020, 2:17am

Are you talking about this: Explain API | Elasticsearch Guide [6.8] | Elastic

The explain api computes a score explanation for a query and a specific document. This can give useful feedback whether a document matches or didn’t match a specific query.

Because I don’t understand how that is helping. It asks me to supply a single document in the original result and then show how it is matching — but we are searching 40 million entries so the single document analysis does not actually help me pinpoint what is wrong.

Perhaps you are talking about something different — if so, let me know, thanks!

warkolm · November 5, 2020, 2:44am

Yeah, but you're right in that it's not useful here. Not sure what I was thinking there sorry!

smlbiobot · November 19, 2020, 11:30pm

I have now been able to see the slow logs and identified the queries are against a specific index — however, the slow log does not show the actual query. We have many different kinds of searches:

some are simple
some involve complex bool clauses
some involve complex aggregations

Having just the index name does not help. Is there a way to find the exact search (perhaps get the body of the search)?

warkolm · November 19, 2020, 11:59pm

It should be showing up in the source field as per https://www.elastic.co/guide/en/elasticsearch/reference/7.10/index-modules-slowlog.html#_identifying_search_slow_log_origin

You may also want to reduce the default threadholds to a lower number to try to catch more queries.

smlbiobot · November 20, 2020, 3:21pm

Related to this, is it in fact possible to output these slow logs as JSON or Logstash or within the internal Kibana monitoring somehow?

Some of the lines are so long that they get truncated.

system · December 18, 2020, 3:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch high cpu usage while searching Elasticsearch	5	2521	July 5, 2017
High CPU load, many hot threads with the same stack Elasticsearch	2	1350	March 23, 2017
Running into Elasticsearch high search latency 5-10s issue in production Elasticsearch	13	3322	July 5, 2017
Elasticsearch -Understanding Hot Threads Elasticsearch	5	2979	October 26, 2018
ES query is a little bit slow, can anyone help have a look? Elasticsearch	2	786	December 8, 2017

How to find problematic search which contributes to high load and CPU usages?

Related topics