How to iterate over field terms in Groovy script?

I am trying to write a custom scoring script that does a cosine similarity. Here's what I have so far:

field_name = 'visual_words'
dot_product = 0
norm_l = 0
norm_r = 0

// Last option: iterate from 0 to dictionary size
// Dictionary size can be passed as argument to query

// remember which terms we have seen
def doc_terms_seen = [:]

index_field = _index[field_name]

// Query words has following shape:
// [[word1, word1_weight], [word2, word2_weight],...]
for (pair in query_words) {
  word = pair[0]
  weight_l = pair[1]
  norm_l = norm_l + (weight_l * weight_l)
  // Get corresponging word token in
  word_str = word.toString()
  termInfo = index_field.get(word_str, _PAYLOADS)
  for (pos in termInfo) {
    // this iterates only once...
    weight_r = pos.payloadAsInt(0)
    norm_r = norm_r + (weight_r * weight_r)
    dot_product = dot_product + (weight_l * weight_r)
    doc_terms_seen[word_str] = true
  }
}

// Maybe disable security?
// https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-security.html
println _index.termVectors()
// outputs: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVFields@51d467a8
println _index.termVectors().terms(field_name)
/* Outputs the following error:

Caused by: java.lang.IllegalAccessException: class is not public: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVFields.terms(String)Terms/invokeVirtual, from org.codehaus.groovy.vmplugin.v7.IndyInterface

Full stack trace: https://gist.github.com/f057ef61d59c39eaf375bfb7be68bfe9
*/

// How to get doc_terms??
/*
for (term in doc_terms) {
  if (doc_terms_seen.get(term.text()) == null) {
    termInfo = index_field.get(term.text(), _PAYLOADS)
    for (pos in termInfo) {
      weight_r = pos.payloadAsInt(0)
      norm_r = norm_r + (weight_r * weight_r)
    }
  }
}
*/

denom = Math.sqrt(norm_l * norm_r)
similarity = dot_product / denom

return similarity

The problem I have is that norm_r is incorrect. Since we are only looping over the terms of the query, we are potentially missing out on some terms in the document. In order to correctly calculate the norm of the document, I would need to be able to iterate all the field's terms. Is that possible? I tried using _index.termVectors().terms(field_name) but am getting the following error:

Caused by: java.lang.IllegalAccessException: class is not public: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVFields.terms(String)Terms/invokeVirtual, from org.codehaus.groovy.vmplugin.v7.IndyInterface

Thanks!

As a temporary work around, I'm now pre-computing and storing the vector norm as a field at index time and accessing it via doc['norm']:

field_name = 'visual_words'
dot_product = 0
norm_l = 0

// remember which terms we have seen
def doc_terms_seen = [:]

// why index 0... I don't know...
norm_r = doc['norm'][0]

index_field = _index[field_name]

// Query words has following shape:
// [[word1, word1_weight], [word2, word2_weight],...]
for (pair in query_words) {
  word = pair[0]
  weight_l = pair[1]
  norm_l = norm_l + (weight_l * weight_l)
  // Get corresponging word token in
  word_str = word.toString()
  termInfo = index_field.get(word_str, _PAYLOADS)
  for (pos in termInfo) { // this iterates only once...
    weight_r = pos.payloadAsInt(0)
    // norm_r = norm_r + (weight_r * weight_r)
    dot_product = dot_product + (weight_l * weight_r)
    doc_terms_seen[word] = true
  }
}

denom = Math.sqrt(norm_l * norm_r)
similarity = dot_product / denom

return similarity

That being said, I'd still be interested to know how to iterate over all terms of a field from a script.

Edit:

I ended up rewriting this as a native plugin and was able to access termVectors.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.