Low level implementation details for query processing

Hi there,

I'm looking to implement both the MinimumShouldMatch and the CommonTermsQuery (in their basic forms) in my own search system purely for benchmarking purposes.

I have sifted through the source code but am unsure on the implementation details of these queries.

I assume the CommonTermsQuery just processes the "rare" terms as a ranked disjunctive query, and then iterates the postings lists for the common terms, but only visiting and updating the scores for the docid's in the top-k list? I read that if only common terms are supplied, the query is ran as a ranked conjunction.

As for the MinimumShouldMatch query, I assume this is implemented as a ranked disjunction with the added check that there are at least n matching terms before processing a candidate document?

In addition, does ES use a dynamic pruning strategy for additional efficiency, ala Maxscore or Wand?


This is mostly true except that the rare terms potentially contribute to the score of all documents, not only the top-k.

It could work this way but the implementation is a bit less naive than that though, as it partitions clauses into 3 groups: clauses that match the current document, clauses that do not match the current document and clauses that have not been tested againts the current document yet. Then it iterates over the non-verified clauses one-by-one, ordered by ascending cost (the expected number of matches in Lucene terminology), checks whether they match, and puts them into the right group depending on the result. As soon as the number of non-matching clauses gets larger than ${number of clauses} - ${minimum should match}, then it knows the current document has no chances to match anymore and it moves to the next candidate. This is expected to be faster than just running a ranked disjunction and verifying the number of matches, especially when ${minimum should match} is high.

No it does not. There have been discussions about using Maxscore in Lucene, you can read about it at https://issues.apache.org/jira/browse/LUCENE-4100.


Thanks a lot, I really appreciate your response!

So, just to be clear, with CommonTerms, the basic algorithm (for top-k retrieval) is as follows:

  1. Run "Rare" query to retrieve all documents containing one-to-many of these rare terms
  2. For each document in the set of candidates from step 1, "update" the final score of each doc with the common term contributions
  3. Return the top-k ranked documents from the refined set in step 2.

Thanks again!

In practice, everything happens in a single pass: every time we find a document that matches the rare query, we check whether it also matches the common query and update the score accordingly.

Okay, great, that makes more sense. Thanks a lot for your help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.