Is it possible access whole document from TokenFilter context?

comdiv · May 9, 2015, 7:48am

We try to make TokenFilter that extends token stream with resolved objects from text.
Some of terms are contextual and their meaning are based on some fields of document (source, publishdate).
It's not an option to store advanced data outer from text because highlighting is required option.

Is it possible somehow supply current document being indexed to TokenFilter level?

For now we do it following way -

1-st we call _analyze without storing
create reader on result of _analyze and manually apply filters
collect special tokens in result and create advanced tag fields for document
store document to index

Problems:

analyzer is called twice
x3 client-server traffic
we make our analyzer in Java/Lucene stack to allow integration in ES, but our main client stack is .NET/Mono based so we need to make some wrappers/ports
no bulks
no ability work from SENSE and any other client

For now we decide to write Plugin for RESTApi and make this chain server side - PUT /OURMODULE/TARGETINDEX/OBJECT/ID but we think that more flexible way is to provide some interfaces that allow access indexer context from TokenFilter. I think it's analyzer level concern.

May be it's already accomplished but I haven't found any example.

Topic		Replies	Views
Index pre-analyzed text by sending the actual terms/tokens? Elasticsearch	6	724	December 10, 2020
Custom analyzer to include all the given text and tokenize it Elasticsearch	1	410	April 4, 2017
Analyzed tokens to array on document Elasticsearch	1	428	May 28, 2019
Analyzer plugin needs access to multiple fields Elasticsearch	2	431	July 5, 2017
Using analyze API for encryption at rest Elasticsearch	1	139	December 11, 2023

Is it possible access whole document from TokenFilter context?

Related topics