Search over most frequent matches / terms without TF or IDF adjustment


(Peter) #1

Hi there,

we are working on a text-based search (via the famous "Type your search here" input box) that computes the score over multiple fields and shows the best results. It's basically a bool query with a mixture of "term" and "match" over many different fields (using fuzzyness, ngram, edge-ngrams and others).

We want the best results (being most "popular") to show up first (thus get the highest score). However the default TF-IDF algorithm of lucene gives us the exakt opposite. Image you look for a vendor that exists in 30% of all index entries. It will have a very high IDF and be ranked very low. We just want the exact opposite of that - give us the most frequent first(!).

Trying our best luck with the the "cross-field" query did not work out since we want to combine different query types with "bool".

Now, what we "found out" is that using Okapi BM25 with k1=0 and b=0 almost(?) behaves like a similarity that ignores TF/IDF. However we feel unsure if this really is the way to go.

Can you give us some feedback on that, please?

Is this the way to go or for our "problem" is there better waiting to be discovered?

Best regards
Peter


(system) #2