I understand the purpose of tf*idf is to reduce weight of more common
terms such as 'the', effectively boosting less common terms (http://
en.wikipedia.org/wiki/Tfidf).
-
Can I reverse or offset the Inverse Document Frequency scoring
portion of the scoring equation such that MORE COMMON terms will
contribute more to the final score? (Would it ever make sense to do
this?) My objective is to provide search results which contain the
most popular terms (terms in many documents). For example, I would
want a text query for "chilled beer" to return "chilled beer" first
(its an exact match) followed by documents containing the term "beer"
as its the most popular term (appearing in 5/6 documents), then the
remaining document containing "chilled mug" will be displayed last
("chilled" only appears in 2/6 documents). -
Somewhat related, can I access docFreq, tf, idf, etc. in a _script
object in a custom_score query (similar to accessing the _score
parameter)?
Here is my quick example using elasticsearch 0.19.1:
create the mapping with one string field
curl -XPUT localhost:9200/scripts -d '{ "mappings" : { "script" :
{ "properties" : { "line" : { "type" : "string" } } } } }'
populate with 6 documents: 5/6 contain "beer" and 2/6 contain
"chilled"
curl -XPUT localhost:9200/scripts/script/1 -d '{"line" : "beer"}'
curl -XPUT localhost:9200/scripts/script/2 -d '{"line" : "beer pong"}'
curl -XPUT localhost:9200/scripts/script/3 -d '{"line" : "beer
goggles"}'
curl -XPUT localhost:9200/scripts/script/4 -d '{"line" : "beer bong"}'
curl -XPUT localhost:9200/scripts/script/5 -d '{"line" : "chilled
beer"}'
curl -XPUT localhost:9200/scripts/script/6 -d '{"line" : "chilled
mug"}'
query (with facet info)
curl -XGET localhost:9200/scripts/script/_search?pretty -d '{"query":
{ "dis_max": {"queries": [ {"text_phrase": {"line": {"query": "chilled
beer", "boost": 10.0}}}, {"text": {"line": {"query": "chilled
beer"}}}]}},"facets": {"line": {"terms" : {"field": "line", "size":
10}}}, "explain": false}'
search results, I would like them ordered as such: "chilled beer",
"beer", "beer ...", "chilled mug"
"hits" : [ {
"_index" : "scripts",
"_type" : "script",
"_id" : "5",
"_score" : 0.38356602, "_source" : {"line" : "chilled beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "1",
"_score" : 0.025, "_source" : {"line" : "beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "6",
"_score" : 0.015625, "_source" : {"line" : "chilled mug"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "4",
"_score" : 0.0022515603, "_source" : {"line" : "beer bong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "2",
"_score" : 0.0022515603, "_source" : {"line" : "beer pong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "3",
"_score" : 0.0022515603, "_source" : {"line" : "beer goggles"}
} ]
},
Is there a better way to do this than trying to reverse the idf
portion of the scoring equation? Any help would be appreciated.
Lucene Practical Scoring Function
http://elasticsearch-users.115913.n3.nabble.com/Scoring-and-boost-td3568997.html