Compute term frequency diff

aczapszys · August 1, 2019, 5:18pm

I have an use case where I am comparing input titles to a corpus of titles to determine whether I have seen this title before. Using the default scoring match on the title field works great for ranking, but I also want to extract a comparison of the input title and the suggested title.

My thinking is that I can add a script_fields to compute a TF difference as follows, but I'm in the weeds on this one. I'm not sure how to start. Can I script this as a field?

Pseudo-code:
get terms from corpus title "corpus_terms"
get terms from input title "input_terms"

let all_terms = union(corpus_terms , input_terms)
let total = 0
let total_pcts = 0
for term in all_terms:
    let weight = 0.0
    if term in input_terms:
        if term in corpus_terms:
            weight = 1.0
        else:
            weight = 0.0
    else:
        # Slight penalty for terms in corpus title not appearing in input_terms
        weight = -0.001

    let pct_of_titles_with_term = index[term].doc_count() / index.doc_count()
    total_pcts += pct_of_titles_with_term
    total += (pct_of_titles_with_term * weight)

let difference_from_0_to_1 = total / total_pcts
return difference_from_0_to_1

system · August 29, 2019, 5:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.