Question regarding TF/IDF implementation

We have been using an older Elastic version (1.4) for a while and have recently upgraded but wish to continue using the TF/IDF scoring algorithm.

In Similarity module | Elasticsearch Reference [7.11] | Elastic, there is an example of how to re-implement TF/IDF.

It is:
"source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"

Why is field.docCount being used instead of simply the number of indexed documents?

Hi Yaron, in elasticsearch (even in version 1.4) similarity is per-field. So idf part is based on the field "scope", meaning the count of documents which contain this field.
In version 7.x , I think (did not test) you can still use tf/idf , even if deprecated, by setting the index-level similarity to "classic", as in similarity | Elasticsearch Reference [7.11] | Elastic

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.