Query string scoped seems to calculate score incorrectly

Can anybody explain why the second search block assignes the same score to
every document? While the first one does what I expect it to?

curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}'
curl -X POST "http://localhost:9200/index/document/3" -d '{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}'
curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}'

PROPER SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "CA"
}
}
}'

2013-03-11 16:15:40:682 [200] (1 msec)

[0.5036848, 0.44072422, 0.35615897]

BAD SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}'

2013-03-11 16:17:24:273 [200] (1 msec)

[0.71231794, 0.71231794, 0.71231794]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So it's probably an artifact of the default search type (Query Then Fetchhttp://www.elasticsearch.org/guide/reference/api/search/search-type.html)

  • small data + multiple shards. If I specify DFS Query Then Fetch I get
    these results (this is also equivalent to dumping all your test results
    into a single shard):

curl -X GET
'http://localhost:9200/index/_search?per_page=10&pretty&search_type=dfs_query_then_fetch'
-d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}' | grep _score

  "_score" : 0.71231794, "_source" : 

{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}
"_score" : 0.5036848, "_source" :
{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}
"_score" : 0.4451987, "_source" :
{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}

Now, the "identical score" problem is gone, but the results are not exactly
intuitive. Why does #1 return before #2, even though #2 has "CA" twice?
The Explain API helps understand this one (note: Explain doesn't use
search_type, so you have to have all your docs in a single shard):

$ curl -XGET http://localhost:9200/index/document/1/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.71231794,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(states_ties:ca)=1)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 1.0,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

$ curl -XGET http://localhost:9200/index/document/2/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "2",
"matched" : true,
"explanation" : {
"value" : 0.5036848,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(termFreq(states_ties:ca)=2)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 0.5,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

With Explain, you can see that the first document gets a TF weight of 1.0,
an IDF of 0.71 and a fieldNorm of 1.0.

The second doc gets a TF of 1.41, since the term is repeated twice, it gets
a larger weight than #1. IDF is the same, since this applies to both
documents equally. However, the fieldNorm value is only 0.5, which is what
ultimately reduces doc #2 to second place in the sorting. FieldNorm
essentially normalizes fields based on their length, so that longer docs
are not implicitly weighted more than short docs simply because they
contain more words.

As a result, in this particular scenario, the length normalization is
changing up the sort order slightly. Of course, sort relevance is somewhat
subjective - perhaps a doc with a single term that matches is more relevant
than one that mentions the term five times...but in five paragraphs.

The query that uses _all has similar results, but slightly different values
because the title is also included in the calculation.

Hope this helps!
-Zach

On Tuesday, March 12, 2013 12:33:29 PM UTC-4, Bruno Miranda wrote:

Can anybody explain why the second search block assignes the same score to
every document? While the first one does what I expect it to?

curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}'
curl -X POST "http://localhost:9200/index/document/3" -d '{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}'
curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}'

PROPER SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "CA"
}
}
}'

2013-03-11 16:15:40:682 [200] (1 msec)

[0.5036848, 0.44072422, 0.35615897]

BAD SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}'

2013-03-11 16:17:24:273 [200] (1 msec)

[0.71231794, 0.71231794, 0.71231794]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thank you for the explanation. It makes sense. It also let's me know I need
to look for an alternative way to sort this puppy.

If you have any suggestions, I'd be open to them.

Thank you

On Tuesday, March 12, 2013 10:40:38 AM UTC-7, Zachary Tong wrote:

So it's probably an artifact of the default search type (Query Then Fetchhttp://www.elasticsearch.org/guide/reference/api/search/search-type.html)

  • small data + multiple shards. If I specify DFS Query Then Fetch I get
    these results (this is also equivalent to dumping all your test results
    into a single shard):

curl -X GET '
http://localhost:9200/index/_search?per_page=10&pretty&search_type=dfs_query_then_fetch'
-d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}' | grep _score

  "_score" : 0.71231794, "_source" : 

{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}
"_score" : 0.5036848, "_source" :
{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}
"_score" : 0.4451987, "_source" :
{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}

Now, the "identical score" problem is gone, but the results are not
exactly intuitive. Why does #1 return before #2, even though #2 has "CA"
twice? The Explain API helps understand this one (note: Explain doesn't
use search_type, so you have to have all your docs in a single shard):

$ curl -XGET http://localhost:9200/index/document/1/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.71231794,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(states_ties:ca)=1)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 1.0,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

$ curl -XGET http://localhost:9200/index/document/2/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "2",
"matched" : true,
"explanation" : {
"value" : 0.5036848,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(termFreq(states_ties:ca)=2)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 0.5,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

With Explain, you can see that the first document gets a TF weight of 1.0,
an IDF of 0.71 and a fieldNorm of 1.0.

The second doc gets a TF of 1.41, since the term is repeated twice, it
gets a larger weight than #1. IDF is the same, since this applies to both
documents equally. However, the fieldNorm value is only 0.5, which is what
ultimately reduces doc #2 to second place in the sorting. FieldNorm
essentially normalizes fields based on their length, so that longer docs
are not implicitly weighted more than short docs simply because they
contain more words.

As a result, in this particular scenario, the length normalization is
changing up the sort order slightly. Of course, sort relevance is somewhat
subjective - perhaps a doc with a single term that matches is more relevant
than one that mentions the term five times...but in five paragraphs.

The query that uses _all has similar results, but slightly different
values because the title is also included in the calculation.

Hope this helps!
-Zach

On Tuesday, March 12, 2013 12:33:29 PM UTC-4, Bruno Miranda wrote:

Can anybody explain why the second search block assignes the same score
to every document? While the first one does what I expect it to?

curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}'
curl -X POST "http://localhost:9200/index/document/3" -d '{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}'
curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}'

PROPER SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "CA"
}
}
}'

2013-03-11 16:15:40:682 [200] (1 msec)

[0.5036848, 0.44072422, 0.35615897]

BAD SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}'

2013-03-11 16:17:24:273 [200] (1 msec)

[0.71231794, 0.71231794, 0.71231794]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

For those curious: I was able to accomplish it like
this: https://gist.github.com/brupm/5147496

On Tuesday, March 12, 2013 12:03:49 PM UTC-7, Bruno Miranda wrote:

Thank you for the explanation. It makes sense. It also let's me know I
need to look for an alternative way to sort this puppy.

If you have any suggestions, I'd be open to them.

Thank you

On Tuesday, March 12, 2013 10:40:38 AM UTC-7, Zachary Tong wrote:

So it's probably an artifact of the default search type (Query Then Fetchhttp://www.elasticsearch.org/guide/reference/api/search/search-type.html)

  • small data + multiple shards. If I specify DFS Query Then Fetch I get
    these results (this is also equivalent to dumping all your test results
    into a single shard):

curl -X GET '
http://localhost:9200/index/_search?per_page=10&pretty&search_type=dfs_query_then_fetch'
-d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}' | grep _score

  "_score" : 0.71231794, "_source" : 

{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}
"_score" : 0.5036848, "_source" :
{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}
"_score" : 0.4451987, "_source" :
{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}

Now, the "identical score" problem is gone, but the results are not
exactly intuitive. Why does #1 return before #2, even though #2 has "CA"
twice? The Explain API helps understand this one (note: Explain doesn't
use search_type, so you have to have all your docs in a single shard):

$ curl -XGET http://localhost:9200/index/document/1/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.71231794,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(states_ties:ca)=1)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 1.0,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

$ curl -XGET http://localhost:9200/index/document/2/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'

{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "2",
"matched" : true,
"explanation" : {
"value" : 0.5036848,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(termFreq(states_ties:ca)=2)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 0.5,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}

With Explain, you can see that the first document gets a TF weight of
1.0, an IDF of 0.71 and a fieldNorm of 1.0.

The second doc gets a TF of 1.41, since the term is repeated twice, it
gets a larger weight than #1. IDF is the same, since this applies to both
documents equally. However, the fieldNorm value is only 0.5, which is what
ultimately reduces doc #2 to second place in the sorting. FieldNorm
essentially normalizes fields based on their length, so that longer docs
are not implicitly weighted more than short docs simply because they
contain more words.

As a result, in this particular scenario, the length normalization is
changing up the sort order slightly. Of course, sort relevance is somewhat
subjective - perhaps a doc with a single term that matches is more relevant
than one that mentions the term five times...but in five paragraphs.

The query that uses _all has similar results, but slightly different
values because the title is also included in the calculation.

Hope this helps!
-Zach

On Tuesday, March 12, 2013 12:33:29 PM UTC-4, Bruno Miranda wrote:

Can anybody explain why the second search block assignes the same score
to every document? While the first one does what I expect it to?

curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}'
curl -X POST "http://localhost:9200/index/document/3" -d '{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}'
curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}'

PROPER SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "CA"
}
}
}'

2013-03-11 16:15:40:682 [200] (1 msec)

[0.5036848, 0.44072422, 0.35615897]

BAD SCORE CALCULATION

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}'

2013-03-11 16:17:24:273 [200] (1 msec)

[0.71231794, 0.71231794, 0.71231794]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.