I am wondering if there is a solution to a slow-query problem I am having.
- Can it be a single, slow node? How do I find out?
- Maybe the query node is the bottleneck? It does not look busy, how to find out?
- The shards are too small? Half the shards will go twice as fast?
- Too many nodes? Half the nodes (with twice the shards) will reduce the maximum latency
- Maybe this it "just big"?
This is a 40-node cluster with a number of indexes. The shards for each index have been spread evenly over the nodes to ensure each node picks up some of the query effort. Each shard is targeted to be about 20gigs in size so they can be moved/recovered within a reasonable amount of time.
This particular query hits 9 indexes, each representing 3months of data, for a total of 180 shards
Thank you
Here is the slow query
{
"_source":false,
"from":0,
"query":{"bool":{"should":[
{"bool":{"filter":[
{"nested":{
"inner_hits":{
"_source":false,
"size":100000,
"stored_fields":["failure.notes.~N~.text.~s~"]
},
"path":"failure.notes.~N~",
"query":{"match_all":{}}
}},
{"bool":{"filter":[
{"prefix":{"repo.changeset.id.~s~":"b760586ab7e62af195a44bbaa43b01be047c11db"}},
{"term":{"repo.branch.name.~s~":"autoland"}},
{"bool":{"must_not":{"term":{"run.tier.~n~":3}}}},
{"bool":{"must_not":{"term":{"run.result.~s~":"retry"}}}},
{"bool":{"must_not":{"term":{"job.type.name.~s~":"Gecko Decision Task"}}}},
{"bool":{"must_not":{"prefix":{"job.type.name.~s~":"Action"}}}}
]}}
]}},
{"bool":{"filter":[
{"prefix":{"repo.changeset.id.~s~":"b760586ab7e62af195a44bbaa43b01be047c11db"}},
{"term":{"repo.branch.name.~s~":"autoland"}},
{"bool":{"must_not":{"term":{"run.tier.~n~":3}}}},
{"bool":{"must_not":{"term":{"run.result.~s~":"retry"}}}},
{"bool":{"must_not":{"term":{"job.type.name.~s~":"Gecko Decision Task"}}}},
{"bool":{"must_not":{"prefix":{"job.type.name.~s~":"Action"}}}}
]}}
]}},
"size":10000,
"sort":[],
"stored_fields":[
"run.taskcluster.id.~s~",
"run.taskcluster.retry_id.~n~",
"job.type.name.~s~",
"run.result.~s~",
"failure.classification.~s~",
"action.duration.~n~"
]
}
Here is the result (minus most the actual records):
{
"took":5527,
"timed_out":false,
"_shards":{"total":180,"successful":180,"skipped":0,"failed":0},
"hits":{
"total":1953,
"max_score":0,
"hits":[{
"_index":"treeherder20200401_000000",
"_type":"th_job",
"_id":"297002129",
"_score":0,
"fields":{
"job.type.name.~s~":["test-linux1804-64-shippable/opt-awsy-base-e10s"],
"run.result.~s~":["success"],
"run.taskcluster.retry_id.~n~":[0],
"failure.classification.~s~":["not classified"],
"run.taskcluster.id.~s~":["Cax7OVybTTuve3IR4nzNZg"],
"action.duration.~n~":[346]
},
"inner_hits":{"failure.notes.~N~":{"hits":{"total":0,"max_score":null,"hits":[]}}}
}
...
For comparison, here is a simple count. It also takes long.
{
"_source":false,
"from":0,
"query":{"match_all":{}},
"size":0,
"sort":[]
}
and the result
{
"took":2470,
"timed_out":false,
"_shards":{"total":180,"successful":180,"skipped":0,"failed":0},
"hits":{"total":158976878,"max_score":0,"hits":[]}
}