Scoring and boost

Ciryus · December 7, 2011, 11:43pm

Hi,

I'm new to ElasticSearch, and I have been playing with boosting. Here is my
test: https://gist.github.com/1445264.

I am a bit confused by the results. The boosting effects are those that I
expect for a single word, but the two words search give scores much closer
than I would expect. In particular, the phrase search gives the same score
in all cases, which I don't understand (I expect the same scoring as in
case 1: keyword, title, body).

Should I largely increase the boosting, or should I do my mapping and/or
queries differently to achieve this result?

Thanks for any help.

alichi · December 8, 2011, 9:01am

I believe this is a lucene thing more than anything. Try this URL:

http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
*http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html

This is the core of it:

  score(q,d)   =   coord(q,d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_coord>

· queryNorm(q)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_queryNorm
· ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_tf
· idf(t)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf
2 · t.getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost
· norm(t,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_norm
) t in q Lucene Practical Scoring Function

where

tf(t in d) correlates to the term's frequency, defined as the
number of times term t appears in the currently scored document d.
Documents that have more occurrences of a given term receive a higher
score. Note that tf(t in q) is assumed to be 1 and therefore it does
not appear in this equation, However if a query contains twice the same
term, there will be two term-queries with that same term and hence the
computation would still be correct (although not very efficient). The
default computation for tf(t in d) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)
is:

 tf(t in d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)>
  =   frequency½

idf(t) stands for Inverse Document Frequency. This value
correlates to the inverse of docFreq (the number of documents in which
the term t appears). This means rarer terms give higher contribution
to the total score. idf(t) appears for t in both the query and the
document, hence it is squared in the equation. The default computation for
idf(t) in DefaultSimilarity<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,
int)> is:

 idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,

int)> = 1 + log ( numDocs ––––––––– docFreq+1 )

coord(q,d) is a score factor based on how many of the query terms
are found in the specified document. Typically, a document that contains
more of the query's terms will receive a higher score than another document
with fewer query terms. This is a search time factor computed in *
coord(q,d)*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#coord(int,
int)> by the Similarity in effect at search time.
*queryNorm(q) *is a normalizing factor used to make scores between
queries comparable. This factor does not affect document ranking (since all
ranked documents are multiplied by the same factor), but rather just
attempts to make scores from different queries (or even different indexes)
comparable. This is a search time factor computed by the Similarity in
effect at search time. The default computation in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
produces a Euclidean normhttp://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm
:

 queryNorm(q)   =   queryNorm(sumOfSquaredWeights)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)>
  =     1 –––––––––––––– sumOfSquaredWeights½

The sum of squared weights (of the query terms) is computed by the query
Weighthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html
object. For example, a boolean queryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html
computes this value as:

 sumOfSquaredWeights<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html#sumOfSquaredWeights()>
  =   q.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()>
2  ·  ∑ ( idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf>
 ·  t.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost>
) 2  t in q

t.getBoost() is a search time boost of term t in the query q as
specified in the query text (see query syntax<http://lucene.apache.org/java/3_0_2/queryparsersyntax.html#Boosting a Term>),
or as set by application calls to setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#setBoost(float).
Notice that there is really no direct API for accessing a boost of one term
in a multi term query, but rather multi terms are represented in a query as
multi TermQueryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/TermQuery.html
objects, and so the boost of a term in the query is accessible by
calling the sub-query getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()
.
norm(t,d) encapsulates a few (indexing time) boost and length
factors:

  - *Document boost* - set by calling *doc.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#setBoost(float)>
   before adding the document to the index. 
  - *Field boost* - set by calling *field.setBoost()*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)>
   before adding the field to a document. 
  - *lengthNorm(field)*<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, 
  int)> - computed when the document is added to the index in 
  accordance with the number of tokens of this field in the document, so that 
  shorter fields contribute more to the score. LengthNorm is computed by the 
  Similarity class in effect at indexing.

When a document is added to the index, all the above factors are
multiplied. If the document has multiple fields with the same name, all
their boosts are multiplied together:

 norm(t,d)   =   doc.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#getBoost()>
 ·  lengthNorm(field)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,

int)> · ∏ f.getBoosthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#getBoost()
() field f in d named as t

However the resulted norm value is encodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)
as a single byte before being stored. At search time, the norm byte
value is read from the index directoryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/store/Directory.html
and decodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)
back to a float norm value. This encoding/decoding, while reducing
index size, comes with the price of precision loss - it is not guaranteed
that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75
.

Compression of norm values to a single byte saves memory at search time,
because once a field is referenced at search time, its norms - for all
documents - are maintained in memory.

The rationale supporting such lossy compression of norm values is that
given the difficulty (and inaccuracy) of users to express their true
information need by a query, only big differences matter.

Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
for search.

Stefan_Nguyen · July 14, 2012, 1:18pm

Great answer!

Would you pls help with my situation?
I have documents with a 'sentence' field of type String. How can I boost
the score of documents with shorter 'sentence' values?

On Thursday, December 8, 2011 4:01:57 PM UTC+7, Ali Loghmani wrote:

I believe this is a lucene thing more than anything. Try this URL:

Similarity (Lucene 3.0.3 API)
*http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html

This is the core of it:
  score(q,d)   =   coord(q,d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_coord>
· queryNorm(q)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_queryNorm
· ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_tf
· idf(t)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf
2 · t.getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost
· norm(t,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_norm
) t in q Lucene Practical Scoring Function

where

tf(t in d) correlates to the term's frequency, defined as the
number of times term t appears in the currently scored document d.
Documents that have more occurrences of a given term receive a higher
score. Note that tf(t in q) is assumed to be 1 and therefore it
does not appear in this equation, However if a query contains twice the
same term, there will be two term-queries with that same term and hence the
computation would still be correct (although not very efficient). The
default computation for tf(t in d) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)
is:
 tf(t in d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)>
  =   frequency½
idf(t) stands for Inverse Document Frequency. This value
correlates to the inverse of docFreq (the number of documents in
which the term t appears). This means rarer terms give higher
contribution to the total score. idf(t) appears for t in both the
query and the document, hence it is squared in the equation. The default
computation for idf(t) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,+int)
is:
 idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,+int)>  
= 1 + log ( numDocs ––––––––– docFreq+1 )

coord(q,d) is a score factor based on how many of the query
terms are found in the specified document. Typically, a document that
contains more of the query's terms will receive a higher score than another
document with fewer query terms. This is a search time factor computed in
coord(q,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#coord(int,+int)
by the Similarity in effect at search time.

*queryNorm(q) *is a normalizing factor used to make scores between
queries comparable. This factor does not affect document ranking (since all
ranked documents are multiplied by the same factor), but rather just
attempts to make scores from different queries (or even different indexes)
comparable. This is a search time factor computed by the Similarity in
effect at search time. The default computation in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
produces a Euclidean normhttp://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm
:
 queryNorm(q)   =   queryNorm(sumOfSquaredWeights)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)>
  =     1 –––––––––––––– sumOfSquaredWeights½
The sum of squared weights (of the query terms) is computed by the
query Weighthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html
object. For example, a boolean queryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html
computes this value as:
 sumOfSquaredWeights<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html#sumOfSquaredWeights()>
  =   q.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()>
2  ·  ∑ ( idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf>
 ·  t.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost>
) 2  t in q 
t.getBoost() is a search time boost of term t in the query q as
specified in the query text (see query syntaxhttp://lucene.apache.org/java/3_0_2/queryparsersyntax.html#Boosting+a+Term),
or as set by application calls to setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#setBoost(float).
Notice that there is really no direct API for accessing a boost of one term
in a multi term query, but rather multi terms are represented in a query as
multi TermQueryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/TermQuery.html
objects, and so the boost of a term in the query is accessible by
calling the sub-query getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()
.

norm(t,d) encapsulates a few (indexing time) boost and length
factors:
- Document boost - set by calling doc.setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#setBoost(float)
before adding the document to the index.
- Field boost - set by calling field.setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)
before adding the field to a document.
- lengthNorm(field)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,+int)

computed when the document is added to the index in accordance
with the number of tokens of this field in the document, so that shorter
fields contribute more to the score. LengthNorm is computed by the
Similarity class in effect at indexing.

When a document is added to the index, all the above factors are
multiplied. If the document has multiple fields with the same name, all
their boosts are multiplied together:
 norm(t,d)   =   doc.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#getBoost()>
 ·  lengthNorm(field)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,+int)>
 ·  ∏ f.getBoost<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#getBoost()>
() field f in d named as t

However the resulted norm value is encodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)
as a single byte before being stored. At search time, the norm byte
value is read from the index directoryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/store/Directory.html
and decodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)
back to a float norm value. This encoding/decoding, while reducing
index size, comes with the price of precision loss - it is not guaranteed
that decode(encode(x)) = x. For instance, decode(encode(0.89)) =
0.75.

Compression of norm values to a single byte saves memory at search
time, because once a field is referenced at search time, its norms - for
all documents - are maintained in memory.

The rationale supporting such lossy compression of norm values is that
given the difficulty (and inaccuracy) of users to express their true
information need by a query, only big differences matter.

Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
for search.

Ivan · July 18, 2012, 4:59pm

Stefan,

By the characteristics of the TD-IDF formula, shorter sentences should be
boosted. Lucene (and therefore Elasticsearch) uses norms and they are
enabled by default.

Here is a good explanation of norms and term frequencies in Lucene:

Cheers,

Ivan

On Sat, Jul 14, 2012 at 6:18 AM, Stefan Nguyen stnguyenvn@gmail.com wrote:

Great answer!

Would you pls help with my situation?
I have documents with a 'sentence' field of type String. How can I boost
the score of documents with shorter 'sentence' values?

On Thursday, December 8, 2011 4:01:57 PM UTC+7, Ali Loghmani wrote:
I believe this is a lucene thing more than anything. Try this URL:

Index of /__root/docs.lucene.apache.org/core/3_0_3/api/core/org/apache
lucene/search/Similarity.htmlhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html

This is the core of it:
  score(q,d)   =   coord(q,d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_coord>
· queryNorm(q)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_queryNorm
** · ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_tf
· idf(t)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf
2 · t.getBoost(**)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost
· norm(t,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_norm
) t in q Lucene Practical Scoring Function

where

tf(t in d) correlates to the term's frequency, defined as the
number of times term t appears in the currently scored document d.
Documents that have more occurrences of a given term receive a higher
score. Note that tf(t in q) is assumed to be 1 and therefore it
does not appear in this equation, However if a query contains twice the
same term, there will be two term-queries with that same term and hence the
computation would still be correct (although not very efficient). The
default computation for tf(t in d) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)
is:
 tf(t in d)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#tf(float)>
  =   frequency½
idf(t) stands for Inverse Document Frequency. This value
correlates to the inverse of docFreq (the number of documents in
which the term t appears). This means rarer terms give higher
contribution to the total score. idf(t) appears for t in both the
query and the document, hence it is squared in the equation. The default
computation for idf(t) in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,+int)
is:
 idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#idf(int,+int)>
= 1 + log ( numDocs ––––––––– docFreq+1 )

coord(q,d) is a score factor based on how many of the query
terms are found in the specified document. Typically, a document that
contains more of the query's terms will receive a higher score than another
document with fewer query terms. This is a search time factor computed in
coord(q,d)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#coord(int,+int)
by the Similarity in effect at search time.

*queryNorm(q) *is a normalizing factor used to make scores between
queries comparable. This factor does not affect document ranking (since all
ranked documents are multiplied by the same factor), but rather just
attempts to make scores from different queries (or even different indexes)
comparable. This is a search time factor computed by the Similarity in
effect at search time. The default computation in DefaultSimilarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
produces a Euclidean normhttp://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm
:
 queryNorm(q)   =   queryNorm(**sumOfSquaredWeights)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)>
  =     1 –––––––––––––– sumOfSquaredWeights½
The sum of squared weights (of the query terms) is computed by the
query Weighthttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html
object. For example, a boolean queryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html
computes this value as:
 sumOfSquaredWeights<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Weight.html#sumOfSquaredWeights()>
  =   q.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()>
2  ·  ∑ ( idf(t)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_idf>
 ·  t.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost>
) 2  t in q
t.getBoost() is a search time boost of term t in the query q
as specified in the query text (see query syntaxhttp://lucene.apache.org/java/3_0_2/queryparsersyntax.html#Boosting+a+Term),
or as set by application calls to setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#setBoost(float).
Notice that there is really no direct API for accessing a boost of one term
in a multi term query, but rather multi terms are represented in a query as
multi TermQueryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/TermQuery.html
objects, and so the boost of a term in the query is accessible by
calling the sub-query getBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.html#getBoost()
.

norm(t,d) encapsulates a few (indexing time) boost and length
factors:
- Document boost - set by calling doc.setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#setBoost(float)
before adding the document to the index.
- Field boost - set by calling field.setBoost()http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)
befor**e adding the field to a document.
- lengthNorm(field)http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,+int)

computed when the document is added to the index in accordance
with the number of tokens of this field in the document, so that shorter
fields contribute more to the score. LengthNorm is computed by the
Similarity class in effect at indexing.

When a document is added to the index, all the above factors are
multiplied. If the document has multiple fields with the same name, all
their boosts are multiplied together:
 norm(t,d)   =   doc.getBoost()<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Document.html#getBoost()>
 ·  lengthNor**m(field)<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,+int)>
 ·  ∏ f.getBoost<http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Fieldable.html#getBoost()>
() field f in d named as t

However the resulted norm value is encodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)
as a single byte before being stored. At search time, the norm byte
value is read from the index directoryhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/store/Directory.html
and decodedhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)
ba**ck to a float norm value. This encoding/decoding, while
reducing index size, comes with the price of precision loss - it is not
guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89))
= 0.75.

Compression of norm values to a single byte saves memory at search
time, because once a field is referenced at search time, its norms - for
all documents - are maintained in memory.

The rationale supporting such lossy compression of norm values is
that given the difficulty (and inaccuracy) of users to express their true
information need by a query, only big differences matter.

Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarityhttp://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html
for search.

timg · February 26, 2016, 9:41am

can you site some more examples of numbers and their equivalent when encoded/decoded to/from single byte pls. thanks

Ivan · February 26, 2016, 4:00pm

Wow, this is an old topic. I have never seen examples of the actual numbers
of the encoded norm, except for examples showing the lossy nature of the
encode-decode-encode process.

I am assuming you are talking about the length norm and not the overall
norm value which includes boosting. Do not use index-time boosting, use
query-time boosting and/or function scores.

You might have better look on the Lucene list since this is more of an
internal Lucene question. Elasticsearch users tend to be more interested in
scaling and how to make Kibana look better than in pure search. The code
for encoding norms is in ClassicSimilarity:

github.com

apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.search.similarities;


import org.apache.lucene.search.CollectionStatistics;

This file has been truncated. show original

There is a static lookup table, but I have not executed the code to see
what those values actually are.

I should point out that Lucene 6, and therefore Elasticsearch 5.0, will
have the BM25 similarity enabled by default. BM25 provides a more
tunable/adaptive approach to field length normalization. Before going down
the route of attempting to alter the length norm yourself, it might be
worthwhile to check out BM25.

Cheers,

Ivan

Topic		Replies	Views
Issues with scoring and query boost Elasticsearch	2	403	July 6, 2017
Mapping/boosting problem Elasticsearch	15	603	July 6, 2017
Elasticsearch document boosting Elasticsearch	2	404	July 6, 2017
Score values different than Lucene ones Elasticsearch	1	322	May 21, 2019
Newbie elasticssearch questions Elasticsearch	5	377	July 6, 2017

Scoring and boost

Related topics