I believe this is a lucene thing more than anything. Try this URL:
http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
*http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
This is the core of it:
score(q,d) = coord(q,d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_coord>
· queryNorm(q)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_queryNorm
· ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_tf
· idf(t)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf
2 · t.getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost
· norm(t,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_norm
) t in q Lucene Practical Scoring Function
where
-
tf(t in d) correlates to the term's frequency, defined as the
number of times term t appears in the currently scored document d.
Documents that have more occurrences of a given term receive a higher
score. Note that tf(t in q) is assumed to be 1 and therefore it does
not appear in this equation, However if a query contains twice the same
term, there will be two term-queries with that same term and hence the
computation would still be correct (although not very efficient). The
default computation for tf(t in d) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)
is:
tf(t in d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)>
= frequency½
-
idf(t) stands for Inverse Document Frequency. This value
correlates to the inverse of docFreq (the number of documents in which
the term t appears). This means rarer terms give higher contribution
to the total score. idf(t) appears for t in both the query and the
document, hence it is squared in the equation. The default computation for
idf(t) in DefaultSimilarity<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,
int)> is:
idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,
int)> = 1 + log ( numDocs ––––––––– docFreq+1 )
-
coord(q,d) is a score factor based on how many of the query terms
are found in the specified document. Typically, a document that contains
more of the query's terms will receive a higher score than another document
with fewer query terms. This is a search time factor computed in *
coord(q,d)*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#coord(int,
int)> by the Similarity in effect at search time.
-
*queryNorm(q) *is a normalizing factor used to make scores between
queries comparable. This factor does not affect document ranking (since all
ranked documents are multiplied by the same factor), but rather just
attempts to make scores from different queries (or even different indexes)
comparable. This is a search time factor computed by the Similarity in
effect at search time. The default computation in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
produces a Euclidean normhttp://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm
:
queryNorm(q) = queryNorm(sumOfSquaredWeights)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)>
= 1 –––––––––––––– sumOfSquaredWeights½
The sum of squared weights (of the query terms) is computed by the query
Weighthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html
object. For example, a boolean queryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html
computes this value as:
sumOfSquaredWeights<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html#sumOfSquaredWeights()>
= q.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()>
2 · ∑ ( idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf>
· t.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost>
) 2 t in q
-
t.getBoost() is a search time boost of term t in the query q as
specified in the query text (see query syntax<http://lucene.apache.org/java/3_0_2/queryparsersyntax.html#Boosting a Term>),
or as set by application calls to setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#setBoost(float).
Notice that there is really no direct API for accessing a boost of one term
in a multi term query, but rather multi terms are represented in a query as
multi TermQueryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/TermQuery.html
objects, and so the boost of a term in the query is accessible by
calling the sub-query getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()
.
-
norm(t,d) encapsulates a few (indexing time) boost and length
factors:
- *Document boost* - set by calling *doc.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#setBoost(float)>
before adding the document to the index.
- *Field boost* - set by calling *field.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)>
before adding the field to a document.
- *lengthNorm(field)*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,
int)> - computed when the document is added to the index in
accordance with the number of tokens of this field in the document, so that
shorter fields contribute more to the score. LengthNorm is computed by the
Similarity class in effect at indexing.
When a document is added to the index, all the above factors are
multiplied. If the document has multiple fields with the same name, all
their boosts are multiplied together:
norm(t,d) = doc.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#getBoost()>
· lengthNorm(field)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,
int)> · ∏ f.getBoosthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#getBoost()
() field f in d named as t
However the resulted norm value is encodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)
as a single byte before being stored. At search time, the norm byte
value is read from the index directoryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/store/Directory.html
and decodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)
back to a float norm value. This encoding/decoding, while reducing
index size, comes with the price of precision loss - it is not guaranteed
that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75
.
Compression of norm values to a single byte saves memory at search time,
because once a field is referenced at search time, its norms - for all
documents - are maintained in memory.
The rationale supporting such lossy compression of norm values is that
given the difficulty (and inaccuracy) of users to express their true
information need by a query, only big differences matter.
Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
for search.