Slow Query, where to start? (v6.5.4)

klahnakoski · April 16, 2020, 9:03pm

I am wondering if there is a solution to a slow-query problem I am having.

Can it be a single, slow node? How do I find out?
Maybe the query node is the bottleneck? It does not look busy, how to find out?
The shards are too small? Half the shards will go twice as fast?
Too many nodes? Half the nodes (with twice the shards) will reduce the maximum latency
Maybe this it "just big"?

This is a 40-node cluster with a number of indexes. The shards for each index have been spread evenly over the nodes to ensure each node picks up some of the query effort. Each shard is targeted to be about 20gigs in size so they can be moved/recovered within a reasonable amount of time.

This particular query hits 9 indexes, each representing 3months of data, for a total of 180 shards

Thank you

Here is the slow query

{
	"_source":false,
	"from":0,
	"query":{"bool":{"should":[
		{"bool":{"filter":[
			{"nested":{
				"inner_hits":{
					"_source":false,
					"size":100000,
					"stored_fields":["failure.notes.~N~.text.~s~"]
				},
				"path":"failure.notes.~N~",
				"query":{"match_all":{}}
			}},
			{"bool":{"filter":[
				{"prefix":{"repo.changeset.id.~s~":"b760586ab7e62af195a44bbaa43b01be047c11db"}},
				{"term":{"repo.branch.name.~s~":"autoland"}},
				{"bool":{"must_not":{"term":{"run.tier.~n~":3}}}},
				{"bool":{"must_not":{"term":{"run.result.~s~":"retry"}}}},
				{"bool":{"must_not":{"term":{"job.type.name.~s~":"Gecko Decision Task"}}}},
				{"bool":{"must_not":{"prefix":{"job.type.name.~s~":"Action"}}}}
			]}}
		]}},
		{"bool":{"filter":[
			{"prefix":{"repo.changeset.id.~s~":"b760586ab7e62af195a44bbaa43b01be047c11db"}},
			{"term":{"repo.branch.name.~s~":"autoland"}},
			{"bool":{"must_not":{"term":{"run.tier.~n~":3}}}},
			{"bool":{"must_not":{"term":{"run.result.~s~":"retry"}}}},
			{"bool":{"must_not":{"term":{"job.type.name.~s~":"Gecko Decision Task"}}}},
			{"bool":{"must_not":{"prefix":{"job.type.name.~s~":"Action"}}}}
		]}}
	]}},
	"size":10000,
	"sort":[],
	"stored_fields":[
		"run.taskcluster.id.~s~",
		"run.taskcluster.retry_id.~n~",
		"job.type.name.~s~",
		"run.result.~s~",
		"failure.classification.~s~",
		"action.duration.~n~"
	]
}

Here is the result (minus most the actual records):

{
	"took":5527,
	"timed_out":false,
	"_shards":{"total":180,"successful":180,"skipped":0,"failed":0},
	"hits":{
		"total":1953,
		"max_score":0,
		"hits":[{
			"_index":"treeherder20200401_000000",
			"_type":"th_job",
			"_id":"297002129",
			"_score":0,
			"fields":{
				"job.type.name.~s~":["test-linux1804-64-shippable/opt-awsy-base-e10s"],
				"run.result.~s~":["success"],
				"run.taskcluster.retry_id.~n~":[0],
				"failure.classification.~s~":["not classified"],
				"run.taskcluster.id.~s~":["Cax7OVybTTuve3IR4nzNZg"],
				"action.duration.~n~":[346]
			},
			"inner_hits":{"failure.notes.~N~":{"hits":{"total":0,"max_score":null,"hits":[]}}}
		}
    ...

For comparison, here is a simple count. It also takes long.

{
	"_source":false,
	"from":0,
	"query":{"match_all":{}},
	"size":0,
	"sort":[]
}

and the result

{
	"took":2470,
	"timed_out":false,
	"_shards":{"total":180,"successful":180,"skipped":0,"failed":0},
	"hits":{"total":158976878,"max_score":0,"hits":[]}
}

Christian_Dahlqvist · April 17, 2020, 3:31am

How much data does each node hold? What type of hardware are you using?

Have you tried to identify what is causing the slowness? Is CPU being maxed out on some node? Is disk I/O or iowait high on any of the nodes while the query is running?

klahnakoski · April 17, 2020, 12:21pm

Thank you Christian_Dahlqvist,

I am not seeing high CPU on any node. When it comes to disk IO, I have not looked: But I do notice a big difference between the cold version of the query, and the warm version I presented here, so I am assuming memory is used. I can run the same query with approximately the same results each time (4 to 10 seconds).

My question is about what tool, or API, do I use to show Elasticsearch's accounting of the query run? Is there an endpoint that provides timing for the various steps taken to satisfy a query? For example I have an application that provides the timing for various steps it takes, so I know what subsystem may be causing the latency. Does ES have something that will measure the steps in query response, even if only the response time of nodes?

Even if that does not exist, what is the best tool that may give me hints why a query would be slow? There is cluster monitoring, but I think it is too coarse-grain to understand queries.

Example timing steps

    "timing":{
        "es":4.502410888671875,
        "find_table":0,
        "preamble":0,
        "save":0,
        "total":0,
        "jsonification":0.1,
        "translate":10
    }

Christian_Dahlqvist · April 17, 2020, 12:39pm

You could try the profile API.

klahnakoski · April 18, 2020, 10:55am

Very good! I can now see each [node][index][shard] and the response times. They are all speedy, except one:

"time_in_nanos": 1,921,544,782   # commas added for humanity

Now I can drill down into this sick node.

Thank you Christian_Dahlqvist! I hope you have a good weekend!

system · May 16, 2020, 10:55am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES slow query problem Elasticsearch	3	364	July 6, 2017
Slow search query issue [5.5] Elasticsearch	6	674	September 22, 2017
Elasticsearch replica shard distribution Elasticsearch	3	898	August 31, 2017
Elasticsearch queries are slow Elasticsearch	6	504	November 4, 2022
Slow Indexing speed / Bottleneck Elasticsearch	6	722	September 16, 2020

Slow Query, where to start? (v6.5.4)

Related topics