Slow Query, where to start? (v6.5.4)

I am wondering if there is a solution to a slow-query problem I am having.

  • Can it be a single, slow node? How do I find out?
  • Maybe the query node is the bottleneck? It does not look busy, how to find out?
  • The shards are too small? Half the shards will go twice as fast?
  • Too many nodes? Half the nodes (with twice the shards) will reduce the maximum latency
  • Maybe this it "just big"?

This is a 40-node cluster with a number of indexes. The shards for each index have been spread evenly over the nodes to ensure each node picks up some of the query effort. Each shard is targeted to be about 20gigs in size so they can be moved/recovered within a reasonable amount of time.

This particular query hits 9 indexes, each representing 3months of data, for a total of 180 shards

Thank you

Here is the slow query

{
	"_source":false,
	"from":0,
	"query":{"bool":{"should":[
		{"bool":{"filter":[
			{"nested":{
				"inner_hits":{
					"_source":false,
					"size":100000,
					"stored_fields":["failure.notes.~N~.text.~s~"]
				},
				"path":"failure.notes.~N~",
				"query":{"match_all":{}}
			}},
			{"bool":{"filter":[
				{"prefix":{"repo.changeset.id.~s~":"b760586ab7e62af195a44bbaa43b01be047c11db"}},
				{"term":{"repo.branch.name.~s~":"autoland"}},
				{"bool":{"must_not":{"term":{"run.tier.~n~":3}}}},
				{"bool":{"must_not":{"term":{"run.result.~s~":"retry"}}}},
				{"bool":{"must_not":{"term":{"job.type.name.~s~":"Gecko Decision Task"}}}},
				{"bool":{"must_not":{"prefix":{"job.type.name.~s~":"Action"}}}}
			]}}
		]}},
		{"bool":{"filter":[
			{"prefix":{"repo.changeset.id.~s~":"b760586ab7e62af195a44bbaa43b01be047c11db"}},
			{"term":{"repo.branch.name.~s~":"autoland"}},
			{"bool":{"must_not":{"term":{"run.tier.~n~":3}}}},
			{"bool":{"must_not":{"term":{"run.result.~s~":"retry"}}}},
			{"bool":{"must_not":{"term":{"job.type.name.~s~":"Gecko Decision Task"}}}},
			{"bool":{"must_not":{"prefix":{"job.type.name.~s~":"Action"}}}}
		]}}
	]}},
	"size":10000,
	"sort":[],
	"stored_fields":[
		"run.taskcluster.id.~s~",
		"run.taskcluster.retry_id.~n~",
		"job.type.name.~s~",
		"run.result.~s~",
		"failure.classification.~s~",
		"action.duration.~n~"
	]
}

Here is the result (minus most the actual records):

{
	"took":5527,
	"timed_out":false,
	"_shards":{"total":180,"successful":180,"skipped":0,"failed":0},
	"hits":{
		"total":1953,
		"max_score":0,
		"hits":[{
			"_index":"treeherder20200401_000000",
			"_type":"th_job",
			"_id":"297002129",
			"_score":0,
			"fields":{
				"job.type.name.~s~":["test-linux1804-64-shippable/opt-awsy-base-e10s"],
				"run.result.~s~":["success"],
				"run.taskcluster.retry_id.~n~":[0],
				"failure.classification.~s~":["not classified"],
				"run.taskcluster.id.~s~":["Cax7OVybTTuve3IR4nzNZg"],
				"action.duration.~n~":[346]
			},
			"inner_hits":{"failure.notes.~N~":{"hits":{"total":0,"max_score":null,"hits":[]}}}
		}
    ...

For comparison, here is a simple count. It also takes long.

{
	"_source":false,
	"from":0,
	"query":{"match_all":{}},
	"size":0,
	"sort":[]
}

and the result

{
	"took":2470,
	"timed_out":false,
	"_shards":{"total":180,"successful":180,"skipped":0,"failed":0},
	"hits":{"total":158976878,"max_score":0,"hits":[]}
}

How much data does each node hold? What type of hardware are you using?

Have you tried to identify what is causing the slowness? Is CPU being maxed out on some node? Is disk I/O or iowait high on any of the nodes while the query is running?

Thank you Christian_Dahlqvist,

I am not seeing high CPU on any node. When it comes to disk IO, I have not looked: But I do notice a big difference between the cold version of the query, and the warm version I presented here, so I am assuming memory is used. I can run the same query with approximately the same results each time (4 to 10 seconds).

My question is about what tool, or API, do I use to show Elasticsearch's accounting of the query run? Is there an endpoint that provides timing for the various steps taken to satisfy a query? For example I have an application that provides the timing for various steps it takes, so I know what subsystem may be causing the latency. Does ES have something that will measure the steps in query response, even if only the response time of nodes?

Even if that does not exist, what is the best tool that may give me hints why a query would be slow? There is cluster monitoring, but I think it is too coarse-grain to understand queries.

Example timing steps

    "timing":{
        "es":4.502410888671875,
        "find_table":0,
        "preamble":0,
        "save":0,
        "total":0,
        "jsonification":0.1,
        "translate":10
    }

You could try the profile API.

Very good! I can now see each [node][index][shard] and the response times. They are all speedy, except one:

"time_in_nanos": 1,921,544,782   # commas added for humanity

Now I can drill down into this sick node.

Thank you Christian_Dahlqvist! I hope you have a good weekend!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.