Custom relevance scoring by term frequency averages


(Chris H-3) #1

Hi,

I want to calculate relevance scores in a different way from the default
TFIDF scoring in ES.
In particular I want to calculate it simply as:

AVG( tf(term) / ttf(term) )

where the average is over all matching terms in the query.

For example, suppose I have the following documents:

PUT /documents/document/1
{
"content": "test document test"
}

PUT /documents/document/2
{
"content": "another test document"
}

I want scores for the query below for "test document" to be calculated as:

GET /documents/_search
{
"query": {"match" : {"content": "test document"}}
}

Doc 1:
AVG( tf(test)/ttf(test), tf(document)/ttf(document) ) = AVG(2/3, 1/2) =
7/12

Doc 2:
AVG ( tf(test)/ttf(test), tf(document)/ttf(document) ) = AVG(1/3, 1/2) =
5/12

Is there any way I can achieve this in Elasticsearch?
Later I may want to weight the averages by IDF or document length, but at
the moment I just want to do the above.
Any help greatly appreciated.

Thanks
Chris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5147a6e9-9a9e-4844-b153-50696a2ecc06%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #2

You have a couple of options. The first is writing your own similarity
class which (subclasses TFIDF or
http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/similarities/DefaultSimilarity.html)
and you would override the various methods. I find this option easier,
however I do not think you will be able to access the distributed term
frequencies, so it would work in cases with only one shard or if you do not
mind have potential slight inconsistencies. The more data you have, the
more the non-distributed frequencies even out.

The other option would be to use function scoring. There are some text
scoring examples on the site:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html

If you want to provide a native Java solution, Britta (who wrote much of
the funtion scoring code) contributed examples to Igor's native script
example repo: https://github.com/imotov/elasticsearch-native-script-example

Of interest:
https://github.com/imotov/elasticsearch-native-script-example/blob/master/src/main/java/org/elasticsearch/examples/nativescript/script/TFIDFScoreScript.java
https://github.com/imotov/elasticsearch-native-script-example/blob/master/src/main/java/org/elasticsearch/examples/nativescript/script/CosineSimilarityScoreScript.java

Cheers,

Ivan

On Sat, May 24, 2014 at 7:22 AM, Chris H c.harper80@gmail.com wrote:

Hi,

I want to calculate relevance scores in a different way from the default
TFIDF scoring in ES.
In particular I want to calculate it simply as:

AVG( tf(term) / ttf(term) )

where the average is over all matching terms in the query.

For example, suppose I have the following documents:

PUT /documents/document/1
{
"content": "test document test"
}

PUT /documents/document/2
{
"content": "another test document"
}

I want scores for the query below for "test document" to be calculated as:

GET /documents/_search
{
"query": {"match" : {"content": "test document"}}
}

Doc 1:
AVG( tf(test)/ttf(test), tf(document)/ttf(document) ) = AVG(2/3, 1/2) =
7/12

Doc 2:
AVG ( tf(test)/ttf(test), tf(document)/ttf(document) ) = AVG(1/3, 1/2) =
5/12

Is there any way I can achieve this in Elasticsearch?
Later I may want to weight the averages by IDF or document length, but at
the moment I just want to do the above.
Any help greatly appreciated.

Thanks
Chris

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5147a6e9-9a9e-4844-b153-50696a2ecc06%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/5147a6e9-9a9e-4844-b153-50696a2ecc06%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBZyuTctggTL-1wwtqv7m%3DH1%2Bkgf_kPTYqL9wrvKK%2B4wg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3