Disabling Elasticsearch Inverse Document Frequency scoring on ES relevance score

Am getting irrelevant results when doing a simple match query on the documents.

My documents have a lot of duplicates words but all words are important like.

Documents are:
Doc_1) Nestle
Doc_2) Nestle Kitkat
DOc_3)nestle chocolate
doc_4)nestle candy
doc_5) nestle yoghurt
doc_6) nestle fruit
--hundreds of same kinda documents.

now on doing a simple match query on > nestle it's not scoring Doc_1 first because it's repeated number of times and it's not relevant because of IDF.

I have tried disabling norms using norms:{"enabled": false}, index_options:"docs"
in the field mapping but still am not getting relevant results.

{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        {
          "match": {
            "choclate.name": {
              "query": "Nestle",
              "operator": "and"
            }
          }
        },
        {
          "match": {
            "choclate.whitespace": {
              "query": "Nestle",
              "operator": "and"
            }
          }
        }
      ]
    }
  },
  "from": 0,
  "size": 1
}

do i need to use custom score fucntion ?
if i use custom score function then i won't get other relevance scoring features like field length and normalization etc etc.

Unfortunately this is the kind of requirement that would require to plug in a custom similarity. This issue might do what you want: https://github.com/elastic/elasticsearch/issues/6731. It proposes to add a new similarity that does not take term frequency or document frequency into account, just the number of matching clauses, and document length if norms are enabled. You can upvote it if you think that would address your requirements.

1 Like

Thanks for your reply @jpountz
instead of IDF i would like to have a DF only.

is there any possibility to do that ?

@jpountz Look like both of these requirements are same

Then you would need a custom similarity indeed. Similarities in Lucene are not designed for being extended, but it should be fairly easy to copy-paste an existing impl (typically bm25) and adapt it to your needs.

One solution can be to set k to 0 in BM25, which I believe gives you just the BM25 IDF as an implementation when I look at the formula. More here.

I have a plugin that ignores tf-idf.

You can use it as an example and you can update tf and or idf methods to return term or document frequency.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.