It's a little complicated, but at a high level, think of each query component as representing an iterator. When you call TermQuery.next()
, the query's iterator advances to the next matching document (ignoring how the query actually knows the next matching document). The next query component is advanced with its next()
method, landing on its next matching document. If the first and the second iterators both landed on the same document, you've got a match. If they landed on different docs, you advance the "lagging" iterators up to the most recent document, then keep advancing/leapfrogging the iterators forward until they hit the same doc
This allows iterators to "jump" over large stretches of documents, because the sparsest iterator will cause the rest of the iterators to skip a bunch of documents (since they can't possibly be a match if the sparsest iterator skipped them already).
So that's how it works when you have two clauses that are both required.
Now imagine if both iterators are OR clauses. You can no longer "skip" over documents due to sparsity in one of the iterators. Instead, you have to collect every document from every iterator, from all the queries. So instead of doing less work, you now do a lot more work. It's still doing less work than a linear scan across all documents, but you pay the cost of having to visit many more documents since many of them match the nonrestrictive OR clauses.
2- "You could also try to do some boolean minimization to make sure the queries aren't looking for the same terms. Can you please explain more on this?" So after the query string gets analyze and breaks down to terms, elasticsearch doesn't eliminate the duplicate terms?
Nope. A search for"quick quick quick quick fox"
is different from "quick fox"
, since the duplicated terms will preferentially boost documents that mention "quick" often. Term frequency is an important component of scoring.
What I meant is more related to auto-generated queries. It's not uncommon to see applications spit out autogenerated queries that look like:
((foo AND bar) AND (foo AND baz)) OR (foo AND bar AND baz)
Because each of those components come from a different part of the application. But really, it can be simplified to:
foo AND bar AND baz
Particularly when used in a filtering, non-scoring situation. It may not apply to your situation since your'e using query_string (although you should see if some of your queries are better suited as filters)
Should we increase the number of shards per node? After I read some article I found out that the more shards on the same node (machine) can increase the overhead. What is your recommendation?
Yeah, one shard-per-machine is the "ideal" situation. So you'd want to add more nodes and more primary shards. E.g. bump the cluster to 5 data nodes and then reindex everything into a 5 primary shard index. I don't know if (or how much) that'd help, but that's the theory anyway since there are more nodes/primary shards to process in parallel
4- Is it a good idea to break down the query to smaller portions (let's say 4) and make a query in 4 diff threads and then combine the result? I am guessing the elasticsearch might do that internally.
This is tricky, since four separate searches will have four separate sets of scores. Scoring is done in a per-query context, so you can't really compare/combine results from different queries. But if this is simple filtering where you just need the "final set of matching docs", this approach could work. And it would be faster... a search uses one thread-per-node to prevent thread explosions (e.g. think multiple concurrent search requests, with hundreds of shards per node). So doing an msearch
might give better latency
Related, you might check and see what your nodes are doing during a request. Check the Node Stats API to see if any resource is bottlenecked, perhaps look at the Hot Threads API to see if any of the active threads look suspicious.
The total number of documents are 366136 and each shard has about 450 MB of data.
Your index is rather small (300k is tiny for three nodes), so there's probably a lot of room for improvement. Could you potentially paste up a full query somewhere?