Fuzzy queries relevance score detailed explanation


#1

In what step of the relevance scoring phase do fuzzy-queries apply the Levenstein formula?

I am asking this because I read here that the steps for relevance scoring include TF-IDF, vector space model and other features like a coordination factor, field length normalization, and term or query clause boosting.

Where exactly does applying Levenstein (or Damerau-Levenstein) occur and most importantly, where does the fuzziness come from? What is actually fuzzy about fuzzy queries? Is it related to fuzzy logic in any way?

Thanks in advance!


(Mark Harwood) #2

Fuzzy queries take a single user-provided term and produce several Lucene TermQuery variants, each of which are boosted with a score that reflects the edit distance (the boost for a non-fuzzy query term is usually 1.0 i.e. no boosting effect.). This used to be mixed in with the usual Lucene IDF ranking but to ill effect [1]. Modern versions of fuzzy query now "lie" about document frequencies of the auto-expanded term variants to prevent IDF issues like this one linked.

[1] When searching for 'Boss' with fuzziness, get higher score for 'Bose' than 'Boss'. ? How Comes !?!?


(system) #3