Index pre-analyzed text by sending the actual terms/tokens?

johann-petrak · November 10, 2020, 11:03pm

Is it possible to somehow pass on an already pre-analyzed stream of terms/tokens to an index for a field?

I already have my text broken up into tokens, I have the offsets, I can produce everything an analyzer could produce, but better. Is there a way to pass this on to the index somehow?

Mark_Harwood · November 12, 2020, 11:23am

Hi Johann,
We don't have a raw token stream e.g. JSON format for tokens but we do allow for "clever" external text processors to add arbitrary "annotations" that overlay on the tokens elasticsearch produces.
See the annotated text plugin.

johann-petrak · November 12, 2020, 11:42am

Thank you for that info which is a bit disappointing because just adding the tokens I have would be zero effort. When I had a look at the annotated text plugin it looks like it would be a huge effort to understand how to implement my own plugin for doing what I want and integrate it correctly.

It would really be a huge help if there was a plugin that would essentially support something like a "rawtoken" type of field which could then contain an array of tokens as the annotated text plugin produces.

Are there resources, documentation or help for how to implement such a plugin?

Mark_Harwood · November 12, 2020, 11:52am

I imagine JSON would be a pretty verbose way of passing a tokenized stream, especially with the offset and position information.
You then have to worry about elasticsearch functions that rely on re-tokenizing from the original text - they assume the analysis functions are "in the box" and not something you have to consult external processes for.
The most obvious need for an analyzer is in tokenizing search terms the user provides but others include some highlighters and the significant_text aggregation.

It would be easier adding a Java Analyzer that reproduces the functionality you need.
Out of interest - what functionality do you feel is missing that requires external processing?

johann-petrak · November 12, 2020, 1:46pm

This is about two separate issues really:

which process does the tokenization of documents when during the processing of data

I am working on NLP, so I already have a lot of code and infrastructure to do tokenization, entity recognition, linking, disambiguation etc. This is code often implemented in non-Java languages like Python e.g. the output of a Neural Net. The processing is done in a completely separate job from when I want to actually store the data in ES.

what should the token stream look like for some text

I want to be able to use the tokens from my processing in 1) but also be able to add arbitrary overlapping token sequences - similar to what the annotated text plugin does, but for more realistic NLP annotations, where e.g. a protein name and a gene name can both overlap in various ways, or a movie title and a person name could overlap etc. The annotated text plugin is not flexible enough for this.
Since I want to be able to query for these entities in position as it is possible with the annotated text plugin, I cannot just use separate keyword fields for that.

For the query phase, it would be much easier to create a normal custom analyzer to tokenize the query, so for such a field where I would like to be able to send just the token stream, it would still be useful to be able to specify a custom analyzer for the query phase.

Highlighting is not too important for this usecase as all querying and processing of the query result would be done by a program which would then be able to process the results accordingly.

But anyway: how would one best get started understanding everything that is necessary to implement an analyzer? Is there a tutorial or similar for that somewhere?

Mark_Harwood · November 12, 2020, 2:17pm

A piece of text can have multiple annotations e.g
[Death of Stalin](Movie&Person&Event) was released...
If you're able to resolve overlapping tokens to the longest text span you can bundle them as above to describe the text span. Not ideal but anything else complicates the markup syntax.

I suspect the easiest thing to do is copy one of the existing Analysis plugin's source code as a template.

system · December 10, 2020, 2:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom analyzer to include all the given text and tokenize it Elasticsearch	1	410	April 4, 2017
Indexing PDF's and Perform Text Analytics with ES Elasticsearch	12	3640	October 9, 2018
Is it possible access whole document from TokenFilter context? Elasticsearch	1	501	July 6, 2017
Custom tokenizer in .NET language, push already tokenized text into ES Elasticsearch	2	354	July 6, 2017
Text analysis Elasticsearch	6	1271	April 8, 2019

Index pre-analyzed text by sending the actual terms/tokens?

Related topics