I'm looking to implement both the MinimumShouldMatch and the CommonTermsQuery (in their basic forms) in my own search system purely for benchmarking purposes.
I have sifted through the source code but am unsure on the implementation details of these queries.
I assume the CommonTermsQuery just processes the "rare" terms as a ranked disjunctive query, and then iterates the postings lists for the common terms, but only visiting and updating the scores for the docid's in the top-k list? I read that if only common terms are supplied, the query is ran as a ranked conjunction.
As for the MinimumShouldMatch query, I assume this is implemented as a ranked disjunction with the added check that there are at least n matching terms before processing a candidate document?
In addition, does ES use a dynamic pruning strategy for additional efficiency, ala Maxscore or Wand?
This is mostly true except that the rare terms potentially contribute to the score of all documents, not only the top-k.
It could work this way but the implementation is a bit less naive than that though, as it partitions clauses into 3 groups: clauses that match the current document, clauses that do not match the current document and clauses that have not been tested againts the current document yet. Then it iterates over the non-verified clauses one-by-one, ordered by ascending cost (the expected number of matches in Lucene terminology), checks whether they match, and puts them into the right group depending on the result. As soon as the number of non-matching clauses gets larger than ${number of clauses} - ${minimum should match}, then it knows the current document has no chances to match anymore and it moves to the next candidate. This is expected to be faster than just running a ranked disjunction and verifying the number of matches, especially when ${minimum should match} is high.
In practice, everything happens in a single pass: every time we find a document that matches the rare query, we check whether it also matches the common query and update the score accordingly.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.