Increase query performance for many OR clauses


(Hormoz Tarevern) #1

We are hitting into performance issue when there are too many OR clauses. We needed to change the default max_clause_count from 1024 to a large number. Our query takes about 9 to 12 second to return with large query string. Is there any optimization we should look at? We simply make a match query like this.

"query": {
"match": {
"_all": query_string
}
}

Our query string is pretty long string which we analyze, clean and break it to tokens. Currently we have 4 VMs (nodes). 3 of them are data nodes and one client node.

I have done all the recommndation settings like increase heap size, mlockall, open file limit


(Zachary Tong) #2

You could try running the query through the Profile API, to see if there are any noticeable hotspots. My guess is that each individual term is taking a tiny amount of time, so it's really just death by a thousand cuts.

There's not really much that can be done to optimize these kinds of queries. Every additional OR that's added decreases the exclusivity of the query (more and more documents match) which directly relates to a slowdown, as each additional matching document must be scored.

You could try identifying if some of the queries are really just filtering clauses, and switch those over to filters. You could also try to do some boolean minimization to make sure the queries aren't looking for the same terms (which Lucene doesn't currently optimize out well).

All in all, the best bet is to try and reduce the size of your queries. Do you really need such long query strings? And so many?

Alternatively, if the answer is "yes", you can simply throw more hardware at it. More shards spread across more nodes will lower latencies.


(Hormoz Tarevern) #3

Thank you for your response :slight_smile:. I have couple questions.

1- Since all the clauses are OR, how does elasticsearch process them? Does it do it one by one or create a matrix of all the terms idf and does cosine similarity on each document?

2- "You could also try to do some boolean minimization to make sure the queries aren't looking for the same terms. Can you please explain more on this?" So after the query string gets analyze and breaks down to terms, elasticsearch doesn't eliminate the duplicate terms?

3- We actually try to increase the number of nodes (machine) which did not helps us at all. Currently We have 3 data nodes and one client node. Each data node has 1 shard. Each node or VM has 16 VCPU with 46 GB of RAM. The total number of documents are 366136 and each shard has about 450 MB of data. Should we increase the number of shards per node? After I read some article I found out that the more shards on the same node (machine) can increase the overhead. What is your recommendation?

4- Is it a good idea to break down the query to smaller portions (let's say 4) and make a query in 4 diff threads and then combine the result? I am guessing the elasticsearch might do that internally.

Thank you


(Zachary Tong) #4

It's a little complicated, but at a high level, think of each query component as representing an iterator. When you call TermQuery.next(), the query's iterator advances to the next matching document (ignoring how the query actually knows the next matching document). The next query component is advanced with its next() method, landing on its next matching document. If the first and the second iterators both landed on the same document, you've got a match. If they landed on different docs, you advance the "lagging" iterators up to the most recent document, then keep advancing/leapfrogging the iterators forward until they hit the same doc

This allows iterators to "jump" over large stretches of documents, because the sparsest iterator will cause the rest of the iterators to skip a bunch of documents (since they can't possibly be a match if the sparsest iterator skipped them already).

So that's how it works when you have two clauses that are both required.

Now imagine if both iterators are OR clauses. You can no longer "skip" over documents due to sparsity in one of the iterators. Instead, you have to collect every document from every iterator, from all the queries. So instead of doing less work, you now do a lot more work. It's still doing less work than a linear scan across all documents, but you pay the cost of having to visit many more documents since many of them match the nonrestrictive OR clauses.

2- "You could also try to do some boolean minimization to make sure the queries aren't looking for the same terms. Can you please explain more on this?" So after the query string gets analyze and breaks down to terms, elasticsearch doesn't eliminate the duplicate terms?

Nope. A search for"quick quick quick quick fox" is different from "quick fox", since the duplicated terms will preferentially boost documents that mention "quick" often. Term frequency is an important component of scoring.

What I meant is more related to auto-generated queries. It's not uncommon to see applications spit out autogenerated queries that look like:

((foo AND bar) AND (foo AND baz)) OR (foo AND bar AND baz)

Because each of those components come from a different part of the application. But really, it can be simplified to:

foo AND bar AND baz

Particularly when used in a filtering, non-scoring situation. It may not apply to your situation since your'e using query_string (although you should see if some of your queries are better suited as filters)

Should we increase the number of shards per node? After I read some article I found out that the more shards on the same node (machine) can increase the overhead. What is your recommendation?

Yeah, one shard-per-machine is the "ideal" situation. So you'd want to add more nodes and more primary shards. E.g. bump the cluster to 5 data nodes and then reindex everything into a 5 primary shard index. I don't know if (or how much) that'd help, but that's the theory anyway since there are more nodes/primary shards to process in parallel

4- Is it a good idea to break down the query to smaller portions (let's say 4) and make a query in 4 diff threads and then combine the result? I am guessing the elasticsearch might do that internally.

This is tricky, since four separate searches will have four separate sets of scores. Scoring is done in a per-query context, so you can't really compare/combine results from different queries. But if this is simple filtering where you just need the "final set of matching docs", this approach could work. And it would be faster... a search uses one thread-per-node to prevent thread explosions (e.g. think multiple concurrent search requests, with hundreds of shards per node). So doing an msearch might give better latency

Related, you might check and see what your nodes are doing during a request. Check the Node Stats API to see if any resource is bottlenecked, perhaps look at the Hot Threads API to see if any of the active threads look suspicious.

The total number of documents are 366136 and each shard has about 450 MB of data.

Your index is rather small (300k is tiny for three nodes), so there's probably a lot of room for improvement. Could you potentially paste up a full query somewhere?


(system) #5