Reverse idf so more common terms score higher than rarer terms

Burrito · May 7, 2012, 10:51pm

I understand the purpose of tf*idf is to reduce weight of more common
terms such as 'the', effectively boosting less common terms (http://
en.wikipedia.org/wiki/Tfidf).

Can I reverse or offset the Inverse Document Frequency scoring
portion of the scoring equation such that MORE COMMON terms will
contribute more to the final score? (Would it ever make sense to do
this?) My objective is to provide search results which contain the
most popular terms (terms in many documents). For example, I would
want a text query for "chilled beer" to return "chilled beer" first
(its an exact match) followed by documents containing the term "beer"
as its the most popular term (appearing in 5/6 documents), then the
remaining document containing "chilled mug" will be displayed last
("chilled" only appears in 2/6 documents).
Somewhat related, can I access docFreq, tf, idf, etc. in a _script
object in a custom_score query (similar to accessing the _score
parameter)?

Here is my quick example using elasticsearch 0.19.1:

create the mapping with one string field

curl -XPUT localhost:9200/scripts -d '{ "mappings" : { "script" :
{ "properties" : { "line" : { "type" : "string" } } } } }'

populate with 6 documents: 5/6 contain "beer" and 2/6 contain

"chilled"

curl -XPUT localhost:9200/scripts/script/1 -d '{"line" : "beer"}'
curl -XPUT localhost:9200/scripts/script/2 -d '{"line" : "beer pong"}'
curl -XPUT localhost:9200/scripts/script/3 -d '{"line" : "beer
goggles"}'
curl -XPUT localhost:9200/scripts/script/4 -d '{"line" : "beer bong"}'
curl -XPUT localhost:9200/scripts/script/5 -d '{"line" : "chilled
beer"}'
curl -XPUT localhost:9200/scripts/script/6 -d '{"line" : "chilled
mug"}'

query (with facet info)

curl -XGET localhost:9200/scripts/script/_search?pretty -d '{"query":
{ "dis_max": {"queries": [ {"text_phrase": {"line": {"query": "chilled
beer", "boost": 10.0}}}, {"text": {"line": {"query": "chilled
beer"}}}]}},"facets": {"line": {"terms" : {"field": "line", "size":
10}}}, "explain": false}'

search results, I would like them ordered as such: "chilled beer",

"beer", "beer ...", "chilled mug"

"hits" : [ {
"_index" : "scripts",
"_type" : "script",
"_id" : "5",
"_score" : 0.38356602, "_source" : {"line" : "chilled beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "1",
"_score" : 0.025, "_source" : {"line" : "beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "6",
"_score" : 0.015625, "_source" : {"line" : "chilled mug"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "4",
"_score" : 0.0022515603, "_source" : {"line" : "beer bong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "2",
"_score" : 0.0022515603, "_source" : {"line" : "beer pong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "3",
"_score" : 0.0022515603, "_source" : {"line" : "beer goggles"}
} ]
},

Is there a better way to do this than trying to reverse the idf
portion of the scoring equation? Any help would be appreciated.

Lucene Practical Scoring Function

http://elasticsearch-users.115913.n3.nabble.com/Scoring-and-boost-td3568997.html

jprante · May 8, 2012, 12:44pm

Beside using scripting, there is an Elasticsearch Java
package org.elasticsearch.index.similarity with a SimilarityProvider,
exposing org.apache.lucene.search.Similarity which is documented
here Similarity (Lucene 3.6.0 API)

Configuration is available with the setting index.similarity

So, it should be possible by writing a plugin to add a custom scoring
function to Elasticsearch indexes.

Jörg

On Tuesday, May 8, 2012 12:51:18 AM UTC+2, Burrito wrote:

I understand the purpose of tf*idf is to reduce weight of more common
terms such as 'the', effectively boosting less common terms (http://
tf–idf - Wikipedia).

Can I reverse or offset the Inverse Document Frequency scoring
portion of the scoring equation such that MORE COMMON terms will
contribute more to the final score? (Would it ever make sense to do
this?) My objective is to provide search results which contain the
most popular terms (terms in many documents). For example, I would
want a text query for "chilled beer" to return "chilled beer" first
(its an exact match) followed by documents containing the term "beer"
as its the most popular term (appearing in 5/6 documents), then the
remaining document containing "chilled mug" will be displayed last
("chilled" only appears in 2/6 documents).

Somewhat related, can I access docFreq, tf, idf, etc. in a _script
object in a custom_score query (similar to accessing the _score
parameter)?

Here is my quick example using elasticsearch 0.19.1:

create the mapping with one string field

curl -XPUT localhost:9200/scripts -d '{ "mappings" : { "script" :
{ "properties" : { "line" : { "type" : "string" } } } } }'

populate with 6 documents: 5/6 contain "beer" and 2/6 contain

"chilled"

curl -XPUT localhost:9200/scripts/script/1 -d '{"line" : "beer"}'
curl -XPUT localhost:9200/scripts/script/2 -d '{"line" : "beer pong"}'
curl -XPUT localhost:9200/scripts/script/3 -d '{"line" : "beer
goggles"}'
curl -XPUT localhost:9200/scripts/script/4 -d '{"line" : "beer bong"}'
curl -XPUT localhost:9200/scripts/script/5 -d '{"line" : "chilled
beer"}'
curl -XPUT localhost:9200/scripts/script/6 -d '{"line" : "chilled
mug"}'

query (with facet info)

curl -XGET localhost:9200/scripts/script/_search?pretty -d '{"query":
{ "dis_max": {"queries": [ {"text_phrase": {"line": {"query": "chilled
beer", "boost": 10.0}}}, {"text": {"line": {"query": "chilled
beer"}}}]}},"facets": {"line": {"terms" : {"field": "line", "size":
10}}}, "explain": false}'

search results, I would like them ordered as such: "chilled beer",

"beer", "beer ...", "chilled mug"

"hits" : [ {
"_index" : "scripts",
"_type" : "script",
"_id" : "5",
"_score" : 0.38356602, "_source" : {"line" : "chilled beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "1",
"_score" : 0.025, "_source" : {"line" : "beer"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "6",
"_score" : 0.015625, "_source" : {"line" : "chilled mug"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "4",
"_score" : 0.0022515603, "_source" : {"line" : "beer bong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "2",
"_score" : 0.0022515603, "_source" : {"line" : "beer pong"}
}, {
"_index" : "scripts",
"_type" : "script",
"_id" : "3",
"_score" : 0.0022515603, "_source" : {"line" : "beer goggles"}
} ]
},

Is there a better way to do this than trying to reverse the idf
portion of the scoring equation? Any help would be appreciated.

Lucene Practical Scoring Function

http://elasticsearch-users.115913.n3.nabble.com/Scoring-and-boost-td3568997.html

Topic		Replies	Views
Relevance Score calculation Elasticsearch	1	369	August 1, 2018
Customized document to term scoring Elasticsearch	1	355	July 30, 2020
Different IDF for different documents Elasticsearch	2	449	July 27, 2018
How to completely disable Inverse document frequency? Elasticsearch	5	2020	September 19, 2018
Custom TF-IDF implementation Elasticsearch	1	327	March 30, 2023

Reverse idf so more common terms score higher than rarer terms

create the mapping with one string field

populate with 6 documents: 5/6 contain "beer" and 2/6 contain

query (with facet info)

search results, I would like them ordered as such: "chilled beer",

Lucene Practical Scoring Function

create the mapping with one string field

populate with 6 documents: 5/6 contain "beer" and 2/6 contain

query (with facet info)

search results, I would like them ordered as such: "chilled beer",

Lucene Practical Scoring Function

Related topics