I am trying to write a custom scoring script that does a cosine similarity. Here's what I have so far:
field_name = 'visual_words'
dot_product = 0
norm_l = 0
norm_r = 0
// Last option: iterate from 0 to dictionary size
// Dictionary size can be passed as argument to query
// remember which terms we have seen
def doc_terms_seen = [:]
index_field = _index[field_name]
// Query words has following shape:
// [[word1, word1_weight], [word2, word2_weight],...]
for (pair in query_words) {
word = pair[0]
weight_l = pair[1]
norm_l = norm_l + (weight_l * weight_l)
// Get corresponging word token in
word_str = word.toString()
termInfo = index_field.get(word_str, _PAYLOADS)
for (pos in termInfo) {
// this iterates only once...
weight_r = pos.payloadAsInt(0)
norm_r = norm_r + (weight_r * weight_r)
dot_product = dot_product + (weight_l * weight_r)
doc_terms_seen[word_str] = true
}
}
// Maybe disable security?
// https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-security.html
println _index.termVectors()
// outputs: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVFields@51d467a8
println _index.termVectors().terms(field_name)
/* Outputs the following error:
Caused by: java.lang.IllegalAccessException: class is not public: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVFields.terms(String)Terms/invokeVirtual, from org.codehaus.groovy.vmplugin.v7.IndyInterface
Full stack trace: https://gist.github.com/f057ef61d59c39eaf375bfb7be68bfe9
*/
// How to get doc_terms??
/*
for (term in doc_terms) {
if (doc_terms_seen.get(term.text()) == null) {
termInfo = index_field.get(term.text(), _PAYLOADS)
for (pos in termInfo) {
weight_r = pos.payloadAsInt(0)
norm_r = norm_r + (weight_r * weight_r)
}
}
}
*/
denom = Math.sqrt(norm_l * norm_r)
similarity = dot_product / denom
return similarity
The problem I have is that norm_r
is incorrect. Since we are only looping over the terms of the query, we are potentially missing out on some terms in the document. In order to correctly calculate the norm of the document, I would need to be able to iterate all the field's terms. Is that possible? I tried using _index.termVectors().terms(field_name)
but am getting the following error:
Caused by: java.lang.IllegalAccessException: class is not public: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVFields.terms(String)Terms/invokeVirtual, from org.codehaus.groovy.vmplugin.v7.IndyInterface
Thanks!