Reverse idf so more common terms score higher than rarer terms


(Burrito) #1

I understand the purpose of tf*idf is to reduce weight of more common
terms such as 'the', effectively boosting less common terms (http://
en.wikipedia.org/wiki/Tfidf).

  1. Can I reverse or offset the Inverse Document Frequency scoring
    portion of the scoring equation such that MORE COMMON terms will
    contribute more to the final score? (Would it ever make sense to do
    this?) My objective is to provide search results which contain the
    most popular terms (terms in many documents). For example, I would
    want a text query for "chilled beer" to return "chilled beer" first
    (its an exact match) followed by documents containing the term "beer"
    as its the most popular term (appearing in 5/6 documents), then the
    remaining document containing "chilled mug" will be displayed last
    ("chilled" only appears in 2/6 documents).

  2. Somewhat related, can I access docFreq, tf, idf, etc. in a _script
    object in a custom_score query (similar to accessing the _score
    parameter)?

Here is my quick example using elasticsearch 0.19.1:

create the mapping with one string field

curl -XPUT localhost:9200/scripts -d '{ "mappings" : { "script" :
{ "properties" : { "line" : { "type" : "string" } } } } }'

populate with 6 documents: 5/6 contain "beer" and 2/6 contain

"chilled"

curl -XPUT localhost:9200/scripts/script/1 -d '{"line" : "beer"}'
curl -XPUT localhost:9200/scripts/script/2 -d '{"line" : "beer pong"}'
curl -XPUT localhost:9200/scripts/script/3 -d '{"line" : "beer
goggles"}'
curl -XPUT localhost:9200/scripts/script/4 -d '{"line" : "beer bong"}'
curl -XPUT localhost:9200/scripts/script/5 -d '{"line" : "chilled
beer"}'
curl -XPUT localhost:9200/scripts/script/6 -d '{"line" : "chilled
mug"}'

query (with facet info)

curl -XGET localhost:9200/scripts/script/_search?pretty -d '{"query":
{ "dis_max": {"queries": [ {"text_phrase": {"line": {"query": "chilled
beer", "boost": 10.0}}}, {"text": {"line": {"query": "chilled
beer"}}}]}},"facets": {"line": {"terms" : {"field": "line", "size":
10}}}, "explain": false}'

search results, I would like them ordered as such: "chilled beer",

"beer", "beer ...", "chilled mug"

"hits" : [ {
"_index" : "scripts",
"_type" : "script",
"_id" : "5",
"_score" : 0.38356602, "_source" : {"line" : "chilled beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "1",
"_score" : 0.025, "_source" : {"line" : "beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "6",
"_score" : 0.015625, "_source" : {"line" : "chilled mug"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "4",
"_score" : 0.0022515603, "_source" : {"line" : "beer bong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "2",
"_score" : 0.0022515603, "_source" : {"line" : "beer pong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "3",
"_score" : 0.0022515603, "_source" : {"line" : "beer goggles"}
} ]
},

Is there a better way to do this than trying to reverse the idf
portion of the scoring equation? Any help would be appreciated.

Lucene Practical Scoring Function

http://elasticsearch-users.115913.n3.nabble.com/Scoring-and-boost-td3568997.html


(Jörg Prante) #2

Beside using scripting, there is an Elasticsearch Java
package org.elasticsearch.index.similarity with a SimilarityProvider,
exposing org.apache.lucene.search.Similarity which is documented
here http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

Configuration is available with the setting index.similarity

So, it should be possible by writing a plugin to add a custom scoring
function to ElasticSearch indexes.

Jörg

On Tuesday, May 8, 2012 12:51:18 AM UTC+2, Burrito wrote:

I understand the purpose of tf*idf is to reduce weight of more common
terms such as 'the', effectively boosting less common terms (http://
en.wikipedia.org/wiki/Tfidf).

  1. Can I reverse or offset the Inverse Document Frequency scoring
    portion of the scoring equation such that MORE COMMON terms will
    contribute more to the final score? (Would it ever make sense to do
    this?) My objective is to provide search results which contain the
    most popular terms (terms in many documents). For example, I would
    want a text query for "chilled beer" to return "chilled beer" first
    (its an exact match) followed by documents containing the term "beer"
    as its the most popular term (appearing in 5/6 documents), then the
    remaining document containing "chilled mug" will be displayed last
    ("chilled" only appears in 2/6 documents).

  2. Somewhat related, can I access docFreq, tf, idf, etc. in a _script
    object in a custom_score query (similar to accessing the _score
    parameter)?

Here is my quick example using elasticsearch 0.19.1:

create the mapping with one string field

curl -XPUT localhost:9200/scripts -d '{ "mappings" : { "script" :
{ "properties" : { "line" : { "type" : "string" } } } } }'

populate with 6 documents: 5/6 contain "beer" and 2/6 contain

"chilled"

curl -XPUT localhost:9200/scripts/script/1 -d '{"line" : "beer"}'
curl -XPUT localhost:9200/scripts/script/2 -d '{"line" : "beer pong"}'
curl -XPUT localhost:9200/scripts/script/3 -d '{"line" : "beer
goggles"}'
curl -XPUT localhost:9200/scripts/script/4 -d '{"line" : "beer bong"}'
curl -XPUT localhost:9200/scripts/script/5 -d '{"line" : "chilled
beer"}'
curl -XPUT localhost:9200/scripts/script/6 -d '{"line" : "chilled
mug"}'

query (with facet info)

curl -XGET localhost:9200/scripts/script/_search?pretty -d '{"query":
{ "dis_max": {"queries": [ {"text_phrase": {"line": {"query": "chilled
beer", "boost": 10.0}}}, {"text": {"line": {"query": "chilled
beer"}}}]}},"facets": {"line": {"terms" : {"field": "line", "size":
10}}}, "explain": false}'

search results, I would like them ordered as such: "chilled beer",

"beer", "beer ...", "chilled mug"

"hits" : [ {
"_index" : "scripts",
"_type" : "script",
"_id" : "5",
"_score" : 0.38356602, "_source" : {"line" : "chilled beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "1",
"_score" : 0.025, "_source" : {"line" : "beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "6",
"_score" : 0.015625, "_source" : {"line" : "chilled mug"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "4",
"_score" : 0.0022515603, "_source" : {"line" : "beer bong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "2",
"_score" : 0.0022515603, "_source" : {"line" : "beer pong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "3",
"_score" : 0.0022515603, "_source" : {"line" : "beer goggles"}
} ]
},

Is there a better way to do this than trying to reverse the idf
portion of the scoring equation? Any help would be appreciated.

Lucene Practical Scoring Function

http://elasticsearch-users.115913.n3.nabble.com/Scoring-and-boost-td3568997.html


(system) #3