How to iterate over field terms in Groovy script?

olalonde · April 13, 2017, 3:36am

I am trying to write a custom scoring script that does a cosine similarity. Here's what I have so far:

field_name = 'visual_words'
dot_product = 0
norm_l = 0
norm_r = 0

// Last option: iterate from 0 to dictionary size
// Dictionary size can be passed as argument to query

// remember which terms we have seen
def doc_terms_seen = [:]

index_field = _index[field_name]

// Query words has following shape:
// [[word1, word1_weight], [word2, word2_weight],...]
for (pair in query_words) {
  word = pair[0]
  weight_l = pair[1]
  norm_l = norm_l + (weight_l * weight_l)
  // Get corresponging word token in
  word_str = word.toString()
  termInfo = index_field.get(word_str, _PAYLOADS)
  for (pos in termInfo) {
    // this iterates only once...
    weight_r = pos.payloadAsInt(0)
    norm_r = norm_r + (weight_r * weight_r)
    dot_product = dot_product + (weight_l * weight_r)
    doc_terms_seen[word_str] = true
  }
}

// Maybe disable security?
// https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-security.html
println _index.termVectors()
// outputs: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVFields@51d467a8
println _index.termVectors().terms(field_name)
/* Outputs the following error:

Caused by: java.lang.IllegalAccessException: class is not public: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVFields.terms(String)Terms/invokeVirtual, from org.codehaus.groovy.vmplugin.v7.IndyInterface

Full stack trace: https://gist.github.com/f057ef61d59c39eaf375bfb7be68bfe9
*/

// How to get doc_terms??
/*
for (term in doc_terms) {
  if (doc_terms_seen.get(term.text()) == null) {
    termInfo = index_field.get(term.text(), _PAYLOADS)
    for (pos in termInfo) {
      weight_r = pos.payloadAsInt(0)
      norm_r = norm_r + (weight_r * weight_r)
    }
  }
}
*/

denom = Math.sqrt(norm_l * norm_r)
similarity = dot_product / denom

return similarity

The problem I have is that norm_r is incorrect. Since we are only looping over the terms of the query, we are potentially missing out on some terms in the document. In order to correctly calculate the norm of the document, I would need to be able to iterate all the field's terms. Is that possible? I tried using _index.termVectors().terms(field_name) but am getting the following error:

Caused by: java.lang.IllegalAccessException: class is not public: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVFields.terms(String)Terms/invokeVirtual, from org.codehaus.groovy.vmplugin.v7.IndyInterface

Thanks!

olalonde · April 13, 2017, 7:52am

As a temporary work around, I'm now pre-computing and storing the vector norm as a field at index time and accessing it via doc['norm']:

field_name = 'visual_words'
dot_product = 0
norm_l = 0

// remember which terms we have seen
def doc_terms_seen = [:]

// why index 0... I don't know...
norm_r = doc['norm'][0]

index_field = _index[field_name]

// Query words has following shape:
// [[word1, word1_weight], [word2, word2_weight],...]
for (pair in query_words) {
  word = pair[0]
  weight_l = pair[1]
  norm_l = norm_l + (weight_l * weight_l)
  // Get corresponging word token in
  word_str = word.toString()
  termInfo = index_field.get(word_str, _PAYLOADS)
  for (pos in termInfo) { // this iterates only once...
    weight_r = pos.payloadAsInt(0)
    // norm_r = norm_r + (weight_r * weight_r)
    dot_product = dot_product + (weight_l * weight_r)
    doc_terms_seen[word] = true
  }
}

denom = Math.sqrt(norm_l * norm_r)
similarity = dot_product / denom

return similarity

That being said, I'd still be interested to know how to iterate over all terms of a field from a script.

Edit:

I ended up rewriting this as a native plugin and was able to access termVectors.

system · May 11, 2017, 8:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Native scoring script: iterate over indexed terms of a field Elasticsearch	1	634	July 6, 2017
Using groovy script within function_score and access field payload Elasticsearch	5	1935	July 5, 2017
Issues when running groovy script for "function_score"? Elasticsearch	3	1477	July 5, 2017
Script to iterate field by field in a doc Elasticsearch	2	866	July 5, 2017
Issues while running Groovy script for "function_score"? Elasticsearch	1	311	July 6, 2017

How to iterate over field terms in Groovy script?

Related topics