Scoring and boost


(Ciryus) #1

Hi,

I'm new to ElasticSearch, and I have been playing with boosting. Here is my
test: https://gist.github.com/1445264.

I am a bit confused by the results. The boosting effects are those that I
expect for a single word, but the two words search give scores much closer
than I would expect. In particular, the phrase search gives the same score
in all cases, which I don't understand (I expect the same scoring as in
case 1: keyword, title, body).

Should I largely increase the boosting, or should I do my mapping and/or
queries differently to achieve this result?

Thanks for any help.


(alichi) #2

I believe this is a lucene thing more than anything. Try this URL:

http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
*http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html

This is the core of it:

  score(q,d)   =   coord(q,d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_coord>

· queryNorm(q)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_queryNorm
· ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_tf
· idf(t)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf
2 · t.getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost
· norm(t,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_norm
) t in q Lucene Practical Scoring Function

where

  1. tf(t in d) correlates to the term's frequency, defined as the
    number of times term t appears in the currently scored document d.
    Documents that have more occurrences of a given term receive a higher
    score. Note that tf(t in q) is assumed to be 1 and therefore it does
    not appear in this equation, However if a query contains twice the same
    term, there will be two term-queries with that same term and hence the
    computation would still be correct (although not very efficient). The
    default computation for tf(t in d) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)
    is:
 tf(t in d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)>
  =   frequency½
  1. idf(t) stands for Inverse Document Frequency. This value
    correlates to the inverse of docFreq (the number of documents in which
    the term t appears). This means rarer terms give higher contribution
    to the total score. idf(t) appears for t in both the query and the
    document, hence it is squared in the equation. The default computation for
    idf(t) in DefaultSimilarity<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,
    int)> is:
 idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int, 

int)> = 1 + log ( numDocs ––––––––– docFreq+1 )

  1. coord(q,d) is a score factor based on how many of the query terms
    are found in the specified document. Typically, a document that contains
    more of the query's terms will receive a higher score than another document
    with fewer query terms. This is a search time factor computed in *
    coord(q,d)*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#coord(int,
    int)> by the Similarity in effect at search time.

  2. *queryNorm(q) *is a normalizing factor used to make scores between
    queries comparable. This factor does not affect document ranking (since all
    ranked documents are multiplied by the same factor), but rather just
    attempts to make scores from different queries (or even different indexes)
    comparable. This is a search time factor computed by the Similarity in
    effect at search time. The default computation in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
    produces a Euclidean normhttp://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm
    :

 queryNorm(q)   =   queryNorm(sumOfSquaredWeights)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)>
  =     1 –––––––––––––– sumOfSquaredWeights½

The sum of squared weights (of the query terms) is computed by the query
Weighthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html
object. For example, a boolean queryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html
computes this value as:

 sumOfSquaredWeights<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html#sumOfSquaredWeights()>
  =   q.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()>
2  ·  ∑ ( idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf>
 ·  t.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost>
) 2  t in q 
  1. t.getBoost() is a search time boost of term t in the query q as
    specified in the query text (see query syntax<http://lucene.apache.org/java/3_0_2/queryparsersyntax.html#Boosting a Term>),
    or as set by application calls to setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#setBoost(float).
    Notice that there is really no direct API for accessing a boost of one term
    in a multi term query, but rather multi terms are represented in a query as
    multi TermQueryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/TermQuery.html
    objects, and so the boost of a term in the query is accessible by
    calling the sub-query getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()
    .

  2. norm(t,d) encapsulates a few (indexing time) boost and length
    factors:

  - *Document boost* - set by calling *doc.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#setBoost(float)>
   before adding the document to the index. 
  - *Field boost* - set by calling *field.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)>
   before adding the field to a document. 
  - *lengthNorm(field)*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, 
  int)> - computed when the document is added to the index in 
  accordance with the number of tokens of this field in the document, so that 
  shorter fields contribute more to the score. LengthNorm is computed by the 
  Similarity class in effect at indexing.

When a document is added to the index, all the above factors are
multiplied. If the document has multiple fields with the same name, all
their boosts are multiplied together:

 norm(t,d)   =   doc.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#getBoost()>
 ·  lengthNorm(field)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, 

int)> · ∏ f.getBoosthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#getBoost()
() field f in d named as t

However the resulted norm value is encodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)
as a single byte before being stored. At search time, the norm byte
value is read from the index directoryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/store/Directory.html
and decodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)
back to a float norm value. This encoding/decoding, while reducing
index size, comes with the price of precision loss - it is not guaranteed
that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75
.

Compression of norm values to a single byte saves memory at search time,
because once a field is referenced at search time, its norms - for all
documents - are maintained in memory.

The rationale supporting such lossy compression of norm values is that
given the difficulty (and inaccuracy) of users to express their true
information need by a query, only big differences matter.

Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
for search.


(Stefan Nguyen) #3

Great answer!

Would you pls help with my situation?
I have documents with a 'sentence' field of type String. How can I boost
the score of documents with shorter 'sentence' values?

On Thursday, December 8, 2011 4:01:57 PM UTC+7, Ali Loghmani wrote:

I believe this is a lucene thing more than anything. Try this URL:

http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
*http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html

This is the core of it:

  score(q,d)   =   coord(q,d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_coord>

· queryNorm(q)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_queryNorm
· ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_tf
· idf(t)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf
2 · t.getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost
· norm(t,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_norm
) t in q Lucene Practical Scoring Function

where

  1. tf(t in d) correlates to the term's frequency, defined as the
    number of times term t appears in the currently scored document d.
    Documents that have more occurrences of a given term receive a higher
    score. Note that tf(t in q) is assumed to be 1 and therefore it
    does not appear in this equation, However if a query contains twice the
    same term, there will be two term-queries with that same term and hence the
    computation would still be correct (although not very efficient). The
    default computation for tf(t in d) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)
    is:
 tf(t in d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)>
  =   frequency½
  1. idf(t) stands for Inverse Document Frequency. This value
    correlates to the inverse of docFreq (the number of documents in
    which the term t appears). This means rarer terms give higher
    contribution to the total score. idf(t) appears for t in both the
    query and the document, hence it is squared in the equation. The default
    computation for idf(t) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,+int)
    is:
 idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,+int)>  

= 1 + log ( numDocs ––––––––– docFreq+1 )

  1. coord(q,d) is a score factor based on how many of the query
    terms are found in the specified document. Typically, a document that
    contains more of the query's terms will receive a higher score than another
    document with fewer query terms. This is a search time factor computed in
    coord(q,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#coord(int,+int)
    by the Similarity in effect at search time.

  2. *queryNorm(q) *is a normalizing factor used to make scores between
    queries comparable. This factor does not affect document ranking (since all
    ranked documents are multiplied by the same factor), but rather just
    attempts to make scores from different queries (or even different indexes)
    comparable. This is a search time factor computed by the Similarity in
    effect at search time. The default computation in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
    produces a Euclidean normhttp://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm
    :

 queryNorm(q)   =   queryNorm(sumOfSquaredWeights)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)>
  =     1 –––––––––––––– sumOfSquaredWeights½

The sum of squared weights (of the query terms) is computed by the
query Weighthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html
object. For example, a boolean queryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html
computes this value as:

 sumOfSquaredWeights<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html#sumOfSquaredWeights()>
  =   q.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()>
2  ·  ∑ ( idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf>
 ·  t.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost>
) 2  t in q 
  1. t.getBoost() is a search time boost of term t in the query q as
    specified in the query text (see query syntaxhttp://lucene.apache.org/java/3_0_2/queryparsersyntax.html#Boosting+a+Term),
    or as set by application calls to setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#setBoost(float).
    Notice that there is really no direct API for accessing a boost of one term
    in a multi term query, but rather multi terms are represented in a query as
    multi TermQueryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/TermQuery.html
    objects, and so the boost of a term in the query is accessible by
    calling the sub-query getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()
    .

  2. norm(t,d) encapsulates a few (indexing time) boost and length
    factors:

  - *Document boost* - set by calling *doc.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#setBoost(float)>
   before adding the document to the index. 
  - *Field boost* - set by calling *field.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)>
   before adding the field to a document. 
  - *lengthNorm(field)*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,+int)>
   - computed when the document is added to the index in accordance 
  with the number of tokens of this field in the document, so that shorter 
  fields contribute more to the score. LengthNorm is computed by the 
  Similarity class in effect at indexing.

When a document is added to the index, all the above factors are
multiplied. If the document has multiple fields with the same name, all
their boosts are multiplied together:

 norm(t,d)   =   doc.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#getBoost()>
 ·  lengthNorm(field)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,+int)>
 ·  ∏ f.getBoost<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#getBoost()>

() field f in d named as t

However the resulted norm value is encodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)
as a single byte before being stored. At search time, the norm byte
value is read from the index directoryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/store/Directory.html
and decodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)
back to a float norm value. This encoding/decoding, while reducing
index size, comes with the price of precision loss - it is not guaranteed
that decode(encode(x)) = x. For instance, decode(encode(0.89)) =
0.75
.

Compression of norm values to a single byte saves memory at search
time, because once a field is referenced at search time, its norms - for
all documents - are maintained in memory.

The rationale supporting such lossy compression of norm values is that
given the difficulty (and inaccuracy) of users to express their true
information need by a query, only big differences matter.

Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
for search.


(Ivan Brusic) #4

Stefan,

By the characteristics of the TD-IDF formula, shorter sentences should be
boosted. Lucene (and therefore ElasticSearch) uses norms and they are
enabled by default.

Here is a good explanation of norms and term frequencies in Lucene:

http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e58

Cheers,

Ivan

On Sat, Jul 14, 2012 at 6:18 AM, Stefan Nguyen stnguyenvn@gmail.com wrote:

Great answer!

Would you pls help with my situation?
I have documents with a 'sentence' field of type String. How can I boost
the score of documents with shorter 'sentence' values?

On Thursday, December 8, 2011 4:01:57 PM UTC+7, Ali Loghmani wrote:

I believe this is a lucene thing more than anything. Try this URL:

http://lucene.apache.org/java/3_0_2/api/core/org/apache/
lucene/search/Similarity.html
http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html

This is the core of it:

  score(q,d)   =   coord(q,d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_coord>

· queryNorm(q)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_queryNorm
** · ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_tf
· idf(t)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf
2 · t.getBoost(**)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost
· norm(t,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_norm
) t in q Lucene Practical Scoring Function

where

  1. tf(t in d) correlates to the term's frequency, defined as the
    number of times term t appears in the currently scored document d.
    Documents that have more occurrences of a given term receive a higher
    score. Note that tf(t in q) is assumed to be 1 and therefore it
    does not appear in this equation, However if a query contains twice the
    same term, there will be two term-queries with that same term and hence the
    computation would still be correct (although not very efficient). The
    default computation for tf(t in d) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)
    is:
 tf(t in d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)>
  =   frequency½
  1. idf(t) stands for Inverse Document Frequency. This value
    correlates to the inverse of docFreq (the number of documents in
    which the term t appears). This means rarer terms give higher
    contribution to the total score. idf(t) appears for t in both the
    query and the document, hence it is squared in the equation. The default
    computation for idf(t) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,+int)
    is:
 idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,+int)>

= 1 + log ( numDocs ––––––––– docFreq+1 )

  1. coord(q,d) is a score factor based on how many of the query
    terms are found in the specified document. Typically, a document that
    contains more of the query's terms will receive a higher score than another
    document with fewer query terms. This is a search time factor computed in
    coord(q,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#coord(int,+int)
    by the Similarity in effect at search time.

  2. *queryNorm(q) *is a normalizing factor used to make scores between
    queries comparable. This factor does not affect document ranking (since all
    ranked documents are multiplied by the same factor), but rather just
    attempts to make scores from different queries (or even different indexes)
    comparable. This is a search time factor computed by the Similarity in
    effect at search time. The default computation in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
    produces a Euclidean normhttp://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm
    :

 queryNorm(q)   =   queryNorm(**sumOfSquaredWeights)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)>
  =     1 –––––––––––––– sumOfSquaredWeights½

The sum of squared weights (of the query terms) is computed by the
query Weighthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html
object. For example, a boolean queryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html
computes this value as:

 sumOfSquaredWeights<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html#sumOfSquaredWeights()>
  =   q.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()>
2  ·  ∑ ( idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf>
 ·  t.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost>
) 2  t in q
  1. t.getBoost() is a search time boost of term t in the query q
    as specified in the query text (see query syntaxhttp://lucene.apache.org/java/3_0_2/queryparsersyntax.html#Boosting+a+Term),
    or as set by application calls to setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#setBoost(float).
    Notice that there is really no direct API for accessing a boost of one term
    in a multi term query, but rather multi terms are represented in a query as
    multi TermQueryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/TermQuery.html
    objects, and so the boost of a term in the query is accessible by
    calling the sub-query getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()
    .

  2. norm(t,d) encapsulates a few (indexing time) boost and length
    factors:

  - *Document boost* - set by calling *doc.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#setBoost(float)>
   before adding the document to the index.
  - *Field boost* - set by calling *field.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)>
   befor**e adding the field to a document.
  - *lengthNorm(field)*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,+int)>
   - computed when the document is added to the index in accordance
  with the number of tokens of this field in the document, so that shorter
  fields contribute more to the score. LengthNorm is computed by the
  Similarity class in effect at indexing.

When a document is added to the index, all the above factors are
multiplied. If the document has multiple fields with the same name, all
their boosts are multiplied together:

 norm(t,d)   =   doc.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#getBoost()>
 ·  lengthNor**m(field)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,+int)>
 ·  ∏ f.getBoost<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#getBoost()>

() field f in d named as t

However the resulted norm value is encodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)
as a single byte before being stored. At search time, the norm byte
value is read from the index directoryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/store/Directory.html
and decodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)
ba**ck to a float norm value. This encoding/decoding, while
reducing index size, comes with the price of precision loss - it is not
guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89))
= 0.75
.

Compression of norm values to a single byte saves memory at search
time, because once a field is referenced at search time, its norms - for
all documents - are maintained in memory.

The rationale supporting such lossy compression of norm values is
that given the difficulty (and inaccuracy) of users to express their true
information need by a query, only big differences matter.

Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
for search.


(Timothy John) #5

can you site some more examples of numbers and their equivalent when encoded/decoded to/from single byte pls. thanks


(Ivan Brusic) #6

Wow, this is an old topic. I have never seen examples of the actual numbers
of the encoded norm, except for examples showing the lossy nature of the
encode-decode-encode process.

I am assuming you are talking about the length norm and not the overall
norm value which includes boosting. Do not use index-time boosting, use
query-time boosting and/or function scores.

You might have better look on the Lucene list since this is more of an
internal Lucene question. Elasticsearch users tend to be more interested in
scaling and how to make Kibana look better than in pure search. The code
for encoding norms is in ClassicSimilarity:

There is a static lookup table, but I have not executed the code to see
what those values actually are.

I should point out that Lucene 6, and therefore Elasticsearch 5.0, will
have the BM25 similarity enabled by default. BM25 provides a more
tunable/adaptive approach to field length normalization. Before going down
the route of attempting to alter the length norm yourself, it might be
worthwhile to check out BM25.

Cheers,

Ivan


(system) #7