Scoring - queryNorm differs for documents during one query

Jakub_Neubauer · April 17, 2015, 1:00pm

Hi,
The ES guide states, that when computing score, "The same query
normalization factor is applied to every document" - viz
http://www.elastic.co/guide/en/elasticsearch/guide/master/practical-scoring-function.html#query-norm
But when I try this example:

curl -s -XDELETE 'localhost:9200/ttt'

curl -s -XPUT 'http://localhost:9200/ttt/tweet/1?refresh=true' -d '{
"user" : "a b"
}'
curl -s -XPUT 'http://localhost:9200/ttt/tweet/2?refresh=true' -d '{
"user" : "b c"
}'

curl -s -XGET 'localhost:9200/ttt/_search?explain=trye&format=yaml' -d '
{
"query": {
"match" : { "user" : "a b" }
}
}
'

I got this result - I highlighted the interesting parts:

took: 5
timed_out: false
_shards:
total: 5
successful: 5
failed: 0
hits:
total: 2
max_score: 0.2712221
hits:

_shard: 2
_node: "_baxQafwQ0WyAAZpyIv2ow"
_index: "ttt"
_type: "tweet"
_id: "1"
_score: 0.2712221
_source:
user: "a b"
_explanation:
value: 0.27122214
description: "sum of:"
details:
- value: 0.13561107
  description: "weight(user:a in 0) [PerFieldSimilarity], result of:"
  details:
  - value: 0.13561107
    description: "score(doc=0,freq=1.0), product of:"
    details:
    - value: 0.70710677
      description: "queryWeight, product of:"
      details:
      - value: 0.30685282
        description: "idf(docFreq=1, maxDocs=1)"
      - value: 2.3043842
        description: "queryNorm"
    - value: 0.19178301
      description: "fieldWeight in 0, product of:"
      details:
      - value: 1.0
        description: "tf(freq=1.0), with freq of:"
        details:
        
        value: 1.0
        description: "termFreq=1.0"
      - value: 0.30685282
        description: "idf(docFreq=1, maxDocs=1)"
      - value: 0.625
        description: "fieldNorm(doc=0)"
- value: 0.13561107
  description: "weight(user:b in 0) [PerFieldSimilarity], result of:"
  details:
  - value: 0.13561107
    description: "score(doc=0,freq=1.0), product of:"
    details:
    - value: 0.70710677
      description: "queryWeight, product of:"
      details:
      - value: 0.30685282
        description: "idf(docFreq=1, maxDocs=1)"
      - value: 2.3043842
        description: "queryNorm"
    - value: 0.19178301
      description: "fieldWeight in 0, product of:"
      details:
      - value: 1.0
        description: "tf(freq=1.0), with freq of:"
        details:
        
        value: 1.0
        description: "termFreq=1.0"
      - value: 0.30685282
        description: "idf(docFreq=1, maxDocs=1)"
      - value: 0.625
        description: "fieldNorm(doc=0)"
_shard: 3
_node: "_baxQafwQ0WyAAZpyIv2ow"
_index: "ttt"
_type: "tweet"
_id: "2"
_score: 0.028130025
_source:
user: "b c"
_explanation:
value: 0.028130027
description: "product of:"
details:
- value: 0.056260053
  description: "sum of:"
  details:
  - value: 0.056260053
    description: "weight(user:b in 0) [PerFieldSimilarity], result
    of:"
    details:
    - value: 0.056260053
      description: "score(doc=0,freq=1.0), product of:"
      details:
      - value: 0.29335263
        description: "queryWeight, product of:"
        details:
        
        value: 0.30685282
        description: "idf(docFreq=1, maxDocs=1)"
        
        value: 0.9560043
        description: "queryNorm"
      - value: 0.19178301
        description: "fieldWeight in 0, product of:"
        details:
        
        value: 1.0
        description: "tf(freq=1.0), with freq of:"
        details:
        
        value: 1.0
        description: "termFreq=1.0"
        
        value: 0.30685282
        description: "idf(docFreq=1, maxDocs=1)"
        
        value: 0.625
        description: "fieldNorm(doc=0)"
- value: 0.5
  description: "coord(1/2)"

For the document, where only one term matches, the queryNorm is cca 2.5
times smaller than at document where both terms match. The result is too
much penalty for documents matching only one term.
I can see the same behaviour when using "bool" query with two "should"
clauses.

Is this a bug? Or what is the explanation of this behaviour? Where can I
find more info?

Thank you for help

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3a3953d-9f01-4e78-acf9-44fd95251b81%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jakub_Neubauer · April 17, 2015, 1:09pm

BTW - the reason I'm bothering with this is more complicated. Example in
the question is already simplified to the core. In my real scenario, I use
bool query composing more fuzzy queries. Then, the resulting score
penalizes some documents when only one field matches, which in that case
has more drastic impact to the user experience. For example exact match in
one field is far more scored than two fuzzy fields and so on. But the core
of the problem is the difference of queryNorm between documents which I
don't understand and cannot find the explanation among the Internet
resources.

Dne pátek 17. dubna 2015 15:00:40 UTC+2 Jakub Neubauer napsal(a):

Hi,
The ES guide states, that when computing score, "The same query
normalization factor is applied to every document" - viz
Lucene’s Practical Scoring Function | Elasticsearch: The Definitive Guide [master] | Elastic
But when I try this example:

curl -s -XDELETE 'localhost:9200/ttt'

curl -s -XPUT 'http://localhost:9200/ttt/tweet/1?refresh=true' -d '{
"user" : "a b"
}'
curl -s -XPUT 'http://localhost:9200/ttt/tweet/2?refresh=true' -d '{
"user" : "b c"
}'

curl -s -XGET 'localhost:9200/ttt/_search?explain=trye&format=yaml' -d '
{
"query": {
"match" : { "user" : "a b" }
}
}
'

I got this result - I highlighted the interesting parts:

took: 5
timed_out: false
_shards:
total: 5
successful: 5
failed: 0
hits:
total: 2
max_score: 0.2712221
hits:

_shard: 2
_node: "_baxQafwQ0WyAAZpyIv2ow"
_index: "ttt"
_type: "tweet"
_id: "1"
_score: 0.2712221
_source:
user: "a b"
_explanation:
value: 0.27122214
description: "sum of:"
details:

value: 0.13561107
description: "weight(user:a in 0) [PerFieldSimilarity], result of:"
details:

value: 0.13561107
description: "score(doc=0,freq=1.0), product of:"
details:

value: 0.70710677
description: "queryWeight, product of:"
details:

value: 0.30685282
description: "idf(docFreq=1, maxDocs=1)"

value: 2.3043842
description: "queryNorm"

value: 0.19178301
description: "fieldWeight in 0, product of:"
details:

value: 1.0
description: "tf(freq=1.0), with freq of:"
details:

value: 1.0
description: "termFreq=1.0"

value: 0.30685282
description: "idf(docFreq=1, maxDocs=1)"

value: 0.625
description: "fieldNorm(doc=0)"

value: 0.13561107
description: "weight(user:b in 0) [PerFieldSimilarity], result of:"
details:

value: 0.13561107
description: "score(doc=0,freq=1.0), product of:"
details:

value: 0.70710677
description: "queryWeight, product of:"
details:

value: 0.30685282
description: "idf(docFreq=1, maxDocs=1)"

value: 2.3043842
description: "queryNorm"

value: 0.19178301
description: "fieldWeight in 0, product of:"
details:

value: 1.0
description: "tf(freq=1.0), with freq of:"
details:

value: 1.0
description: "termFreq=1.0"

value: 0.30685282
description: "idf(docFreq=1, maxDocs=1)"

value: 0.625
description: "fieldNorm(doc=0)"

_shard: 3
_node: "_baxQafwQ0WyAAZpyIv2ow"
_index: "ttt"
_type: "tweet"
_id: "2"
_score: 0.028130025
_source:
user: "b c"
_explanation:
value: 0.028130027
description: "product of:"
details:

value: 0.056260053
description: "sum of:"
details:

value: 0.056260053
description: "weight(user:b in 0) [PerFieldSimilarity], result
of:"
details:

value: 0.056260053
description: "score(doc=0,freq=1.0), product of:"
details:

value: 0.29335263
description: "queryWeight, product of:"
details:

value: 0.30685282
description: "idf(docFreq=1, maxDocs=1)"

value: 0.9560043
description: "queryNorm"

value: 0.19178301
description: "fieldWeight in 0, product of:"
details:

value: 1.0
description: "tf(freq=1.0), with freq of:"
details:

value: 1.0
description: "termFreq=1.0"

value: 0.30685282
description: "idf(docFreq=1, maxDocs=1)"

value: 0.625
description: "fieldNorm(doc=0)"

value: 0.5
description: "coord(1/2)"

For the document, where only one term matches, the queryNorm is cca 2.5
times smaller than at document where both terms match. The result is too
much penalty for documents matching only one term.
I can see the same behaviour when using "bool" query with two "should"
clauses.

Is this a bug? Or what is the explanation of this behaviour? Where can I
find more info?

Thank you for help

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/330119ef-9133-4226-89b8-8e28a843015f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jakub_Neubauer · April 21, 2015, 5:46pm

Just some thoughts: As the queryNorm is calculated from terms frequencies -
it seems to me, that it is calculated from only those terms of the query
that somehow "matched" the document in some clause. So in our example, for
first document terms "a" and "b" were used to calculate queryNorm, but for
the second document only term "b". But this is not what one would suppose
from the documentation! I would expect that all query terms would be used
in calculation, to satisfy the statement that queryNorm is fixed for all
hits.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28a667db-e8e1-47b3-a9a2-98cb5a277659%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
queryNorm is defeating the purpose of scoring Elasticsearch	4	1517	July 6, 2017
Different queryNorm values for same query results Elasticsearch	1	410	July 6, 2017
Change score function to return number of matches of search term in the documents Elasticsearch	1	334	May 6, 2019
Query string different score for similar document Elasticsearch	2	754	November 3, 2017
Expecting another result(scoring) on function_score Elasticsearch	2	417	October 23, 2018

Scoring - queryNorm differs for documents during one query

Related topics