I have an use case where I am comparing input titles to a corpus of titles to determine whether I have seen this title before. Using the default scoring match on the title field works great for ranking, but I also want to extract a comparison of the input title and the suggested title.
My thinking is that I can add a script_fields to compute a TF difference as follows, but I'm in the weeds on this one. I'm not sure how to start. Can I script this as a field?
Pseudo-code:
get terms from corpus title "corpus_terms"
get terms from input title "input_terms"
let all_terms = union(corpus_terms , input_terms)
let total = 0
let total_pcts = 0
for term in all_terms:
let weight = 0.0
if term in input_terms:
if term in corpus_terms:
weight = 1.0
else:
weight = 0.0
else:
# Slight penalty for terms in corpus title not appearing in input_terms
weight = -0.001
let pct_of_titles_with_term = index[term].doc_count() / index.doc_count()
total_pcts += pct_of_titles_with_term
total += (pct_of_titles_with_term * weight)
let difference_from_0_to_1 = total / total_pcts
return difference_from_0_to_1