Access to documents in ScriptEngine plugin


(Cameron VandenBerg) #1

Is it possible to access document fields or termvectors within a plugin that implements ScriptEngine? I have a field for each document which contains the length of the document, and I would like to use that value as well as tf to compute a score. I am following this example for writing a plugin: https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-engine.html. However, all my attempts to access the actual document come back null.


(Ryan Ernst) #2

Can you share any of your code?


(Cameron VandenBerg) #3

Here is one example of what I have tried. This is a part of my class which implements ScriptEngine just like in the example. The problem is that the Document, which is returned from the reader (the LeafReader from the context) is null. Is there any way to get a handle on the document within the ScriptEngine context?

'''
@Override
public SearchScript newInstance(LeafReaderContext context) throws IOException {
LeafReader reader = context.reader();
PostingsEnum postings = context.reader().postings(new Term(field, term));

				return new SearchScript(p, lookup, context) {
					int currentDocid = -1;

					@Override
					public void setDocument(int docid) {
						if (postings != null) {
							// advance has undefined behavior calling with a
							// docid <= its current docid
							if (postings.docID() < docid) {
								try {
									postings.advance(docid);
								} catch (IOException e) {
									throw new UncheckedIOException(e);
								}
							}
						}
						currentDocid = docid;
					}

					@Override
					public double runAsDouble() {
						try {
							double mleScore = 0f;
							Document doc = reader.document(currentDocid);
							System.out.println("Document: " + doc);
							double doclen = Double.valueOf(doc.getField("body_len").stringValue()).doubleValue();
							System.out.println("Document length: " + doclen);
							if (postings != null) {
								mleScore = postings.freq() / doclen;
							}

							return mleScore;
						} catch (IOException e) {
							throw new UncheckedIOException(e);
						}
					}
				};
			}

'''


(Ryan Ernst) #4

If you want doc values, the Document is not what you want (that is for stored fields access). You need to get an appropriate doc values instance for the type of data. You can do this using the helper DocValues class from Lucene. For example, to get numeric doc values for a field called "mynum", you would do this next to the postings declaration:

SortedNumericDocValues mynumValues = DocValues.getSortedNumeric(reader, "mynum");
boolean hasMynumValue;

Then in setDocument:

hasMynumValue = mynumValues.advanceExact(docid);

And finally in the scoring function you can use the docvalue iterator to extract the values for the current document by calling mynumValues.nextValue() for each value the doc has (you can find how many values to expect with mynumValues.docValuesCount().


(Cameron VandenBerg) #5

Thank you. If the termvector is stored for a field, is there a way to access that in the ScriptEngine as well?


(Ryan Ernst) #6

Yes, you could access term vectors, but you will need to read Lucene documentation to learn how to access. An advanced script implemented through a ScriptEngine has a LeafReader, which is an IndexReader, and has getTermVector.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.