Custom Payload Analyzer and Query


(Kyley Jex) #1

I'm working on providing advanced searching for annotated Medical Documents
(using UIMA). In the context of an annotated document, I identify relevant
medical terms, as well as the negation of certain terms. Following what
I've read and seen in Lucene examples, I've been able to provide a search
that takes into account the metadata contained in the payload. The search,
however, is very primitive and uses the PayloadSpanUtil to return the
payloads. Then I iterate over the payloads and exclude those that don't
match the criteria. (I'm looking for a better way to exclude the terms
based on the payload during the query).

I've currently implemented a Custom Analyzer in Lucene and also registered
it in ElasticSearch (kudos -- very easy to integrate). However, with
regard to the searching, I'm not sure how payloads are exposed in ES.

I noticed this post (Token Attributes)http://elasticsearch-users.115913.n3.nabble.com/Token-attributes-td2648940.html in
March. Has anything changed regarding this availability?

Cheers,
Kyley


(Shay Banon) #2

You will need to register your own query parser that will generate the query you use to take the payloads into account. That part is not well documented (or does not have samples), but its possible.

First, you need to implement a class that implemented QueryParser (there are many samples to how to implement one, as its the one used for all the different queries in the DSL). Then, you need to register that implementation. Once you do, you effectively enriched the Query DSL with your own query (under its relevant name).

Registering the query can be done in two forms. Either per index, similar to how you register an analyzer. Get a hold of the IndexQueryParserModule and call addQueryParser on it. Or, you can register it globally by having a class that gets injected with IndicesQueriesRegistry and adding the actual implementation there (this is done for optimization purposes to register queries across all indices and not per index).

On Tuesday, January 31, 2012 at 12:36 AM, Kyley Jex wrote:

I'm working on providing advanced searching for annotated Medical Documents (using UIMA). In the context of an annotated document, I identify relevant medical terms, as well as the negation of certain terms. Following what I've read and seen in Lucene examples, I've been able to provide a search that takes into account the metadata contained in the payload. The search, however, is very primitive and uses the PayloadSpanUtil to return the payloads. Then I iterate over the payloads and exclude those that don't match the criteria. (I'm looking for a better way to exclude the terms based on the payload during the query).

I've currently implemented a Custom Analyzer in Lucene and also registered it in ElasticSearch (kudos -- very easy to integrate). However, with regard to the searching, I'm not sure how payloads are exposed in ES.

I noticed this post (Token Attributes) (http://elasticsearch-users.115913.n3.nabble.com/Token-attributes-td2648940.html) in March. Has anything changed regarding this availability?

Cheers,
Kyley


(Kyley Jex) #3

Thanks for the insight. I see now how I can extend the DSL for a new query
type.

However, it's not clear how I can implement a new query that takes into
account the payload AND still return the payload information. In my past
attempts, I was able to create a CustomPayloadTermQuery (extending
TermSpans.next() to filter out terms based on my payload), but that would
only return the hits and not the payload. The only way I was able to
return the payloads was using PayloadSpanUtil.

From what I understand of ES, it seems that after I've registered my own
QueryParser, I would also need to register a new Service (similiar to the
SearchService), that would provide the functionality of calling the query
via the PayloadSpanUtil and return the payloads. Is my understanding that
correct? And how easy is it to extend the Services?


(Shay Banon) #4

Getting back the payload information is trickier. There isn't an extension point that allows for a plugin to add custom data per hit. Though, you can potentially implement your own Java based native script, and then use that script as a script field. In a script field you have access to the doc id and the reader to fetch relevant data that you want.

On Wednesday, February 1, 2012 at 1:46 AM, Kyley Jex wrote:

Thanks for the insight. I see now how I can extend the DSL for a new query type.

However, it's not clear how I can implement a new query that takes into account the payload AND still return the payload information. In my past attempts, I was able to create a CustomPayloadTermQuery (extending TermSpans.next() to filter out terms based on my payload), but that would only return the hits and not the payload. The only way I was able to return the payloads was using PayloadSpanUtil.

From what I understand of ES, it seems that after I've registered my own QueryParser, I would also need to register a new Service (similiar to the SearchService), that would provide the functionality of calling the query via the PayloadSpanUtil and return the payloads. Is my understanding that correct? And how easy is it to extend the Services?


(Kyley Jex) #5

Is this native script and script field an extension of ES or is it
available in Lucene? Are there any examples of this type of extension?

Can I ask whether my use of payloads is appropriate? My intent is to
provide filtering at the term level based upon metadata available within
the context each specific term. For example, the UIMA annotation
associates metadata to the text to distinguish between "He has high blood
pressure", vs. "He has a history of high blood pressure", vs "He does not
have high blood pressure". I require 2 different search capabilities:
match any terms, match only terms not negated. So a search for "high blood
pressure" would not result in a match in the context of "... does not have
..". Is there another way to provide this type of searching?


(Shay Banon) #6

Scripting is explained here: http://www.elasticsearch.org/guide/reference/modules/scripting.html, with specific section for native Java ones.

On Wednesday, February 1, 2012 at 3:47 PM, Kyley Jex wrote:

Is this native script and script field an extension of ES or is it available in Lucene? Are there any examples of this type of extension?

Can I ask whether my use of payloads is appropriate? My intent is to provide filtering at the term level based upon metadata available within the context each specific term. For example, the UIMA annotation associates metadata to the text to distinguish between "He has high blood pressure", vs. "He has a history of high blood pressure", vs "He does not have high blood pressure". I require 2 different search capabilities: match any terms, match only terms not negated. So a search for "high blood pressure" would not result in a match in the context of "... does not have ..". Is there another way to provide this type of searching?


(system) #7