Score based on Term Frequency alone

i want to disable IDF, and maybe TF so that I just have a score based on how many terms are present from the given query. I've looked into solutions writing custom scripts, and playing around with the query, but these all involve splitting up the query itself into individual terms. The problem with this approach is that if you have some wrapper service which takes a query with multiple tokens in a string, you need to find a way to split the query into tokens before feeding them into the ES search request. The only way to do this reliably is to first make a request to the analyze endpoint, but this just slows things down.
(Examples: How to complete disable TF-IDF?, https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html)

I think ES is awesome, but I think it would be cool if there were more similarity modules which cover simple use cases, like when you want a simple count on term presence in a document. Why not just have a bunch load of similarity modules: summing one hot vectors, cosine, etc...

Any advice on how I can achieve my goal? I am using elastic-search 5.2.2.

Thanks!

EDIT: I got this working by writing a plugin. There are examples online but they are outdated, I will formalise my solution, post to GIT, and update this answer in due time.

1 Like

Ok, If anyone wants to disable IDF, disable TF, as to just score based on the presence of a term and boost value on the field, in elastic search v5+, then see the following plugin:

If you don't want to disable TF but don't know how to make a plugin, the code in the repo above should help, adding TF should be simple.

Also note some guy has implemented this into the latest elasticsearch code, see:


You will just need to set "similarity": "boolean" on properties. This is available in elasticsearch 5.4.0 +

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.