Question regarding TF/IDF implementation

Yaron_Golan · March 22, 2021, 3:18am

We have been using an older Elastic version (1.4) for a while and have recently upgraded but wish to continue using the TF/IDF scoring algorithm.

In Similarity module | Elasticsearch Reference [7.11] | Elastic, there is an example of how to re-implement TF/IDF.

It is:
"source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"

Why is field.docCount being used instead of simply the number of indexed documents?

vincenbr · March 22, 2021, 11:25am

Hi Yaron, in elasticsearch (even in version 1.4) similarity is per-field. So idf part is based on the field "scope", meaning the count of documents which contain this field.
In version 7.x , I think (did not test) you can still use tf/idf , even if deprecated, by setting the index-level similarity to "classic", as in similarity | Elasticsearch Reference [7.11] | Elastic

system · April 19, 2021, 11:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Different IDF for different documents Elasticsearch	2	452	July 27, 2018
Custom TF-IDF implementation Elasticsearch	1	339	March 30, 2023
Custom similarity without TF/IDF scoring Elasticsearch	1	321	September 2, 2020
Computing idf in elasticsearch Elasticsearch	5	345	July 6, 2017
Accessing tf-idf Elasticsearch	12	6688	July 6, 2017

Question regarding TF/IDF implementation

Related topics