How to completely disable Inverse document frequency?

I need to make a full text search for names. I would like to disable Inverse document frequency because it is irrelevant in my use case. Is it possible not only ignore Inverse document frequency but also to turn it off to not waste CPU for an unneeded job?

Best regards

Why is it not relevant?

Because IDF is created to increase relevance for unique terms in a shard. I.e. terms which are frequent in specific shard have more relevance than less frequent terms in this shard. IDF is good if your data is well balanced among shard and you have a lot of unnecessary words in your index like: the, is, very and so on.

In my situation, I have an index of user names.
I.e. data like: Will Smith, Katy Perry, Brad Pitt...
Logically, that I won't have any unnecessary words in my index. If IDF will be used, then Will Smith on shard 1 and Will Smith one shard 2 will possibly have different score and the score will be counted not on "how much your query is equal to current data" but "how unique your data on specific shard".

IDF is good for searching something in random texts, but it is bad for names search.

Correct me if I am wrong.

Why not just use filters then?

Also Elasticsearch 6.X used BM25 as its search algorithm, it's much better at dealing with relevance than TF/IDF.

Filters do not provide score. I need score. The more term looks like the indexed data the more it relevant to me.
If I have names like Will Smith, Will Smeeth, Bill Smith and I am searching "Will Smith" I need all Will Smith to be on the top and other name to be on the bottom.

Also, I need all same name to have the same score. Say if I have several Will Smith in shard 1 and several Will Smith in shard 2 I need them all to have the same score. Both BM25 and TF/IDF are using IDF and they do not provide relevant scores for names search.

When I search for "Will Smith" I first need to show all users with name "Will Smith" then all users with "Bill Smith" then all users with "Robert Smith" and so on. In my situation I don't care how many users have the name "Will Smith" or "Robert Smith". All I care is how much the data similar to "Will Smith".

Right now I am ignoring IDF in my similarity and it works well, but I would like also to shutdown the IDF counting because it is unnecessary work which I don't need. Here is what I use right now:

"similarity" : {
  "default": {
    "type": "scripted",
    "weight_script": {
      "source": "return query.boost;"
    },
    "script": {
      "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.