Is it possible access whole document from TokenFilter context?


(FAGIM SADYKOV) #1

We try to make TokenFilter that extends token stream with resolved objects from text.
Some of terms are contextual and their meaning are based on some fields of document (source, publishdate).
It's not an option to store advanced data outer from text because highlighting is required option.

Is it possible somehow supply current document being indexed to TokenFilter level?

For now we do it following way -

  1. 1-st we call _analyze without storing
  2. create reader on result of _analyze and manually apply filters
  3. collect special tokens in result and create advanced tag fields for document
  4. store document to index

Problems:

  1. analyzer is called twice
  2. x3 client-server traffic
  3. we make our analyzer in Java/Lucene stack to allow integration in ES, but our main client stack is .NET/Mono based so we need to make some wrappers/ports
  4. no bulks
  5. no ability work from SENSE and any other client

For now we decide to write Plugin for RESTApi and make this chain server side - PUT /OURMODULE/TARGETINDEX/OBJECT/ID but we think that more flexible way is to provide some interfaces that allow access indexer context from TokenFilter. I think it's analyzer level concern.

May be it's already accomplished but I haven't found any example.


(system) #2