Query string scoped seems to calculate score incorrectly

Bruno_Miranda · March 12, 2013, 4:33pm

Can anybody explain why the second search block assignes the same score to
every document? While the first one does what I expect it to?

curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}'
curl -X POST "http://localhost:9200/index/document/3" -d '{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}'
curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}'

PROPER SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "CA"
}
}
}'

2013-03-11 16:15:40:682 [200] (1 msec)

[0.5036848, 0.44072422, 0.35615897]

BAD SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}'

2013-03-11 16:17:24:273 [200] (1 msec)

[0.71231794, 0.71231794, 0.71231794]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

polyfractal · March 12, 2013, 5:40pm

So it's probably an artifact of the default search type (Query Then Fetchhttp://www.elasticsearch.org/guide/reference/api/search/search-type.html)

small data + multiple shards. If I specify DFS Query Then Fetch I get
these results (this is also equivalent to dumping all your test results
into a single shard):

curl -X GET
'http://localhost:9200/index/_search?per_page=10&pretty&search_type=dfs_query_then_fetch'
-d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}' | grep _score

  "_score" : 0.71231794, "_source" :

{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}
"_score" : 0.5036848, "_source" :
{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}
"_score" : 0.4451987, "_source" :
{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}

Now, the "identical score" problem is gone, but the results are not exactly
intuitive. Why does #1 return before #2, even though #2 has "CA" twice?
The Explain API helps understand this one (note: Explain doesn't use
search_type, so you have to have all your docs in a single shard):

$ curl -XGET http://localhost:9200/index/document/1/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.71231794,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(states_ties:ca)=1)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 1.0,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

$ curl -XGET http://localhost:9200/index/document/2/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "2",
"matched" : true,
"explanation" : {
"value" : 0.5036848,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(termFreq(states_ties:ca)=2)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 0.5,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

With Explain, you can see that the first document gets a TF weight of 1.0,
an IDF of 0.71 and a fieldNorm of 1.0.

The second doc gets a TF of 1.41, since the term is repeated twice, it gets
a larger weight than #1. IDF is the same, since this applies to both
documents equally. However, the fieldNorm value is only 0.5, which is what
ultimately reduces doc #2 to second place in the sorting. FieldNorm
essentially normalizes fields based on their length, so that longer docs
are not implicitly weighted more than short docs simply because they
contain more words.

As a result, in this particular scenario, the length normalization is
changing up the sort order slightly. Of course, sort relevance is somewhat
subjective - perhaps a doc with a single term that matches is more relevant
than one that mentions the term five times...but in five paragraphs.

The query that uses _all has similar results, but slightly different values
because the title is also included in the calculation.

Hope this helps!
-Zach

On Tuesday, March 12, 2013 12:33:29 PM UTC-4, Bruno Miranda wrote:

Can anybody explain why the second search block assignes the same score to
every document? While the first one does what I expect it to?

curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}'
curl -X POST "http://localhost:9200/index/document/3" -d '{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}'
curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}'

PROPER SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "CA"
}
}
}'

2013-03-11 16:15:40:682 [200] (1 msec)

[0.5036848, 0.44072422, 0.35615897]

BAD SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}'

2013-03-11 16:17:24:273 [200] (1 msec)

[0.71231794, 0.71231794, 0.71231794]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bruno_Miranda · March 12, 2013, 7:03pm

Thank you for the explanation. It makes sense. It also let's me know I need
to look for an alternative way to sort this puppy.

If you have any suggestions, I'd be open to them.

Thank you

On Tuesday, March 12, 2013 10:40:38 AM UTC-7, Zachary Tong wrote:

So it's probably an artifact of the default search type (Query Then Fetchhttp://www.elasticsearch.org/guide/reference/api/search/search-type.html)

small data + multiple shards. If I specify DFS Query Then Fetch I get
these results (this is also equivalent to dumping all your test results
into a single shard):

curl -X GET '
http://localhost:9200/index/_search?per_page=10&pretty&search_type=dfs_query_then_fetch'
-d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}' | grep _score
  "_score" : 0.71231794, "_source" : 
{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}
"_score" : 0.5036848, "_source" :
{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}
"_score" : 0.4451987, "_source" :
{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}

Now, the "identical score" problem is gone, but the results are not
exactly intuitive. Why does #1 return before #2, even though #2 has "CA"
twice? The Explain API helps understand this one (note: Explain doesn't
use search_type, so you have to have all your docs in a single shard):

$ curl -XGET http://localhost:9200/index/document/1/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.71231794,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(states_ties:ca)=1)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 1.0,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

$ curl -XGET http://localhost:9200/index/document/2/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "2",
"matched" : true,
"explanation" : {
"value" : 0.5036848,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(termFreq(states_ties:ca)=2)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 0.5,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

With Explain, you can see that the first document gets a TF weight of 1.0,
an IDF of 0.71 and a fieldNorm of 1.0.

The second doc gets a TF of 1.41, since the term is repeated twice, it
gets a larger weight than #1. IDF is the same, since this applies to both
documents equally. However, the fieldNorm value is only 0.5, which is what
ultimately reduces doc #2 to second place in the sorting. FieldNorm
essentially normalizes fields based on their length, so that longer docs
are not implicitly weighted more than short docs simply because they
contain more words.

As a result, in this particular scenario, the length normalization is
changing up the sort order slightly. Of course, sort relevance is somewhat
subjective - perhaps a doc with a single term that matches is more relevant
than one that mentions the term five times...but in five paragraphs.

The query that uses _all has similar results, but slightly different
values because the title is also included in the calculation.

Hope this helps!
-Zach

On Tuesday, March 12, 2013 12:33:29 PM UTC-4, Bruno Miranda wrote:

Can anybody explain why the second search block assignes the same score
to every document? While the first one does what I expect it to?

curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}'
curl -X POST "http://localhost:9200/index/document/3" -d '{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}'
curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}'

PROPER SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "CA"
}
}
}'

2013-03-11 16:15:40:682 [200] (1 msec)

[0.5036848, 0.44072422, 0.35615897]

BAD SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}'

2013-03-11 16:17:24:273 [200] (1 msec)

[0.71231794, 0.71231794, 0.71231794]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bruno_Miranda · March 12, 2013, 10:02pm

For those curious: I was able to accomplish it like
this: https://gist.github.com/brupm/5147496

On Tuesday, March 12, 2013 12:03:49 PM UTC-7, Bruno Miranda wrote:

Thank you for the explanation. It makes sense. It also let's me know I
need to look for an alternative way to sort this puppy.

If you have any suggestions, I'd be open to them.

Thank you

On Tuesday, March 12, 2013 10:40:38 AM UTC-7, Zachary Tong wrote:
So it's probably an artifact of the default search type (Query Then Fetchhttp://www.elasticsearch.org/guide/reference/api/search/search-type.html)

small data + multiple shards. If I specify DFS Query Then Fetch I get
these results (this is also equivalent to dumping all your test results
into a single shard):

curl -X GET '
http://localhost:9200/index/_search?per_page=10&pretty&search_type=dfs_query_then_fetch'
-d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}' | grep _score
  "_score" : 0.71231794, "_source" : 
{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}
"_score" : 0.5036848, "_source" :
{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}
"_score" : 0.4451987, "_source" :
{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}

Now, the "identical score" problem is gone, but the results are not
exactly intuitive. Why does #1 return before #2, even though #2 has "CA"
twice? The Explain API helps understand this one (note: Explain doesn't
use search_type, so you have to have all your docs in a single shard):

$ curl -XGET http://localhost:9200/index/document/1/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.71231794,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(states_ties:ca)=1)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 1.0,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

$ curl -XGET http://localhost:9200/index/document/2/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "2",
"matched" : true,
"explanation" : {
"value" : 0.5036848,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(termFreq(states_ties:ca)=2)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 0.5,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

With Explain, you can see that the first document gets a TF weight of
1.0, an IDF of 0.71 and a fieldNorm of 1.0.

The second doc gets a TF of 1.41, since the term is repeated twice, it
gets a larger weight than #1. IDF is the same, since this applies to both
documents equally. However, the fieldNorm value is only 0.5, which is what
ultimately reduces doc #2 to second place in the sorting. FieldNorm
essentially normalizes fields based on their length, so that longer docs
are not implicitly weighted more than short docs simply because they
contain more words.

As a result, in this particular scenario, the length normalization is
changing up the sort order slightly. Of course, sort relevance is somewhat
subjective - perhaps a doc with a single term that matches is more relevant
than one that mentions the term five times...but in five paragraphs.

The query that uses _all has similar results, but slightly different
values because the title is also included in the calculation.

Hope this helps!
-Zach

On Tuesday, March 12, 2013 12:33:29 PM UTC-4, Bruno Miranda wrote:

Can anybody explain why the second search block assignes the same score
to every document? While the first one does what I expect it to?

curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}'
curl -X POST "http://localhost:9200/index/document/3" -d '{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}'
curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}'

PROPER SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "CA"
}
}
}'

2013-03-11 16:15:40:682 [200] (1 msec)

[0.5036848, 0.44072422, 0.35615897]

BAD SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}'

2013-03-11 16:17:24:273 [200] (1 msec)

[0.71231794, 0.71231794, 0.71231794]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Elasticsearch/Lucene scoring broken? Elasticsearch	11	470	July 6, 2017
inconsistent document scores using search_type=dfs_query_then_fetch (how do the _score and _explanation.value fields relate?) Elasticsearch	8	865	December 16, 2011
How is it calculated _score Elasticsearch	5	450	July 6, 2017
Help me understand how ES calculate the score to match query Elasticsearch	5	1339	July 6, 2017
Help to understand the explain output and scoring Elasticsearch	5	549	July 6, 2017

Query string scoped seems to calculate score incorrectly

PROPER SCORE CALCULATION

2013-03-11 16:15:40:682 [200] (1 msec)

BAD SCORE CALCULATION

2013-03-11 16:17:24:273 [200] (1 msec)

PROPER SCORE CALCULATION

2013-03-11 16:15:40:682 [200] (1 msec)

BAD SCORE CALCULATION

2013-03-11 16:17:24:273 [200] (1 msec)

PROPER SCORE CALCULATION

2013-03-11 16:15:40:682 [200] (1 msec)

BAD SCORE CALCULATION

2013-03-11 16:17:24:273 [200] (1 msec)

PROPER SCORE CALCULATION

2013-03-11 16:15:40:682 [200] (1 msec)

BAD SCORE CALCULATION

2013-03-11 16:17:24:273 [200] (1 msec)

Related topics