So it's probably an artifact of the default search type (Query Then Fetchhttp://www.elasticsearch.org/guide/reference/api/search/search-type.html)
- small data + multiple shards. If I specify DFS Query Then Fetch I get
these results (this is also equivalent to dumping all your test results
into a single shard):
curl -X GET
'http://localhost:9200/index/_search?per_page=10&pretty&search_type=dfs_query_then_fetch'
-d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}' | grep _score
"_score" : 0.71231794, "_source" :
{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}
"_score" : 0.5036848, "_source" :
{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}
"_score" : 0.4451987, "_source" :
{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}
Now, the "identical score" problem is gone, but the results are not exactly
intuitive. Why does #1 return before #2, even though #2 has "CA" twice?
The Explain API helps understand this one (note: Explain doesn't use
search_type, so you have to have all your docs in a single shard):
$ curl -XGET http://localhost:9200/index/document/1/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'
{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.71231794,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(states_ties:ca)=1)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 1.0,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}
$ curl -XGET http://localhost:9200/index/document/2/_explain?pretty -d
'{"query":{"query_string":{"query":"states_ties:CA"}}}'
{
"ok" : true,
"_index" : "index",
"_type" : "document",
"_id" : "2",
"matched" : true,
"explanation" : {
"value" : 0.5036848,
"description" : "fieldWeight(states_ties:ca in 0), product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(termFreq(states_ties:ca)=2)"
}, {
"value" : 0.71231794,
"description" : "idf(docFreq=3, maxDocs=3)"
}, {
"value" : 0.5,
"description" : "fieldNorm(field=states_ties, doc=0)"
} ]
}
}
With Explain, you can see that the first document gets a TF weight of 1.0,
an IDF of 0.71 and a fieldNorm of 1.0.
The second doc gets a TF of 1.41, since the term is repeated twice, it gets
a larger weight than #1. IDF is the same, since this applies to both
documents equally. However, the fieldNorm value is only 0.5, which is what
ultimately reduces doc #2 to second place in the sorting. FieldNorm
essentially normalizes fields based on their length, so that longer docs
are not implicitly weighted more than short docs simply because they
contain more words.
As a result, in this particular scenario, the length normalization is
changing up the sort order slightly. Of course, sort relevance is somewhat
subjective - perhaps a doc with a single term that matches is more relevant
than one that mentions the term five times...but in five paragraphs.
The query that uses _all has similar results, but slightly different values
because the title is also included in the calculation.
Hope this helps!
-Zach
On Tuesday, March 12, 2013 12:33:29 PM UTC-4, Bruno Miranda wrote:
Can anybody explain why the second search block assignes the same score to
every document? While the first one does what I expect it to?
curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"state_abbreviation":"FL","states_ties":["NY","CA","CA"]}'
curl -X POST "http://localhost:9200/index/document/3" -d '{"id":3,"state_abbreviation":"NY","states_ties":["NY","CA"]}'
curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"state_abbreviation":"CA","states_ties":["CA"]}'
PROPER SCORE CALCULATION
curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "CA"
}
}
}'
2013-03-11 16:15:40:682 [200] (1 msec)
[0.5036848, 0.44072422, 0.35615897]
BAD SCORE CALCULATION
curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
"query": {
"query_string": {
"query": "states_ties:CA"
}
}
}'
2013-03-11 16:17:24:273 [200] (1 msec)
[0.71231794, 0.71231794, 0.71231794]
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.