Custom TF-IDF implementation

Karel_Haerens1 · March 2, 2023, 3:32pm

I'm trying to implement a custom TF-IDF-like algorithm with scripted similarity, my current approach:

The way term frequency is determined is custom, these values are precalculated and stored in the records as lists. as an example:

{
"my_text": "the apple falls",
"my_text_counts": [10, 5, 7]
}

in this document the array of numbers represents the term frequency for each word in the textfield.

I now want the similarity score to be the sum of the inverses of these values, if the corresponding word is in the query.

e.g. the query "the apple" would yield
(1 * 1/10 + 1 * 1/5 + 0 * 1 / 7)
and "the falls"
(1 * 1/10 + 0 * 1/5 + 1 * 1 / 7)

After a lot of searching through the documentation I'm starting to think this is impossible with the current scripted similarity context.

Any tips or advice would be welcome

system · March 30, 2023, 3:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[Painless] Ideas on how to implement custom LM relevance function as 'Scripted Similarity'? Elasticsearch	3	666	November 26, 2018
Custom similarity without TF/IDF scoring Elasticsearch	1	321	September 2, 2020
Question regarding TF/IDF implementation Elasticsearch	2	753	April 19, 2021
Raw tf-idf Elasticsearch	6	1129	August 3, 2017
Scripted Similarity performance Elasticsearch	3	1078	April 6, 2018