How to completely disable Inverse document frequency?

porunov · August 21, 2018, 12:48pm

I need to make a full text search for names. I would like to disable Inverse document frequency because it is irrelevant in my use case. Is it possible not only ignore Inverse document frequency but also to turn it off to not waste CPU for an unneeded job?

Best regards

warkolm · August 22, 2018, 1:52am

Why is it not relevant?

porunov · August 22, 2018, 7:16am

Because IDF is created to increase relevance for unique terms in a shard. I.e. terms which are frequent in specific shard have more relevance than less frequent terms in this shard. IDF is good if your data is well balanced among shard and you have a lot of unnecessary words in your index like: the, is, very and so on.

In my situation, I have an index of user names.
I.e. data like: Will Smith, Katy Perry, Brad Pitt...
Logically, that I won't have any unnecessary words in my index. If IDF will be used, then Will Smith on shard 1 and Will Smith one shard 2 will possibly have different score and the score will be counted not on "how much your query is equal to current data" but "how unique your data on specific shard".

IDF is good for searching something in random texts, but it is bad for names search.

Correct me if I am wrong.

warkolm · August 22, 2018, 7:22am

Why not just use filters then?

Also Elasticsearch 6.X used BM25 as its search algorithm, it's much better at dealing with relevance than TF/IDF.

porunov · August 22, 2018, 8:01am

Filters do not provide score. I need score. The more term looks like the indexed data the more it relevant to me.
If I have names like Will Smith, Will Smeeth, Bill Smith and I am searching "Will Smith" I need all Will Smith to be on the top and other name to be on the bottom.

Also, I need all same name to have the same score. Say if I have several Will Smith in shard 1 and several Will Smith in shard 2 I need them all to have the same score. Both BM25 and TF/IDF are using IDF and they do not provide relevant scores for names search.

When I search for "Will Smith" I first need to show all users with name "Will Smith" then all users with "Bill Smith" then all users with "Robert Smith" and so on. In my situation I don't care how many users have the name "Will Smith" or "Robert Smith". All I care is how much the data similar to "Will Smith".

Right now I am ignoring IDF in my similarity and it works well, but I would like also to shutdown the IDF counting because it is unnecessary work which I don't need. Here is what I use right now:

"similarity" : {
  "default": {
    "type": "scripted",
    "weight_script": {
      "source": "return query.boost;"
    },
    "script": {
      "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
    }
  }
}

system · September 19, 2018, 8:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Disabling Elasticsearch Inverse Document Frequency scoring on ES relevance score Elasticsearch	7	4337	March 16, 2017
Score based on Term Frequency alone Elasticsearch	2	3917	May 23, 2017
Calculating with Document Frequency, not Inverse Document Frequency Elasticsearch	7	1373	July 6, 2017
Inverse Document Frequency Scoring with Shared Indices and Routing Elasticsearch	1	491	December 15, 2017
How to disable TF/IDF completely Elasticsearch	7	4693	April 10, 2018

How to completely disable Inverse document frequency?

Related topics