What is the order of evaluation of the TermsFilter (or the newer TermsQuery) when multiple TermsFilter are chained together?


(Swaminathan Rajamohan) #1

What is the order of evaluation of the TermsFilter (or the newer TermsQuery) when multiple TermsFilter are chained together?

for eg.     should { termsquery1, termsquery2}

And what is the order of evaluation in the event that the above TermsFilter(TermsQuery) is nested within another Filter?

for eg.    must{ termsquery2, should{ termsquery1,termsquery2}}

I would like to know if the order in which the terms are specified will matter with regards to query performance?

Specifically having termsquery2 as a filter, will it speed up evaluation of termsquery1 & termsquery2 owing to the fact termsquery1 & termsquery2 will operate on a smaller subset of the corpus.

In my use case.

Termsquery1 -> Large set of terms
Termsquery2 -> smaller set of terms 
and TermsQuery1 >> TermsQuery2 (greatly exceeeds)

(Adrien Grand) #2

The way queries work is by creating a Scorer that iterates in doc ID order over matching documents. A regular disjunction (the one you could create with a bool query) uses a heap in order to merge several iterators into a new iterator on the fly. However when there are many terms, the fact that the priority queue needs to be rebalanced on each iteration can make things very slow. This is why TermsQuery uses a different approach: instead of creating an iterator that will merge the sub iterators on the fly, it creates a bit set and then consumes all iterators and sets bits in the bit set for every visited document. Then it returns an iterator over the bit set.

This approach helps when there are many terms but it has a major downside: it needs to consume all doc IDs from all sub iterators all the time. This means that no matter if you intersect a terms query with a selective query or not, it will always do the same amount of work. Lucene protects against this problem by rewriting TermQuery instances to a ConstantScoreQuery around BooleanQuery when there are less than 16 terms (conservative heuristic), but when you have many terms there is not much you can do about it.

So to answer your question, the order does not matter when chaining, in all cases all terms queries will be evaluated against all documents from the index.


(Swaminathan Rajamohan) #3

Thanks for the explanation @jpountz. That helped a lot. I was suspecting that my use case will fall under a BitSet approach and not give any advantage via the filtering because of the number of terms involved.


(system) #4