Index pre-analyzed text by sending the actual terms/tokens?

Is it possible to somehow pass on an already pre-analyzed stream of terms/tokens to an index for a field?

I already have my text broken up into tokens, I have the offsets, I can produce everything an analyzer could produce, but better. Is there a way to pass this on to the index somehow?

Hi Johann,
We don't have a raw token stream e.g. JSON format for tokens but we do allow for "clever" external text processors to add arbitrary "annotations" that overlay on the tokens elasticsearch produces.
See the annotated text plugin.

Thank you for that info which is a bit disappointing because just adding the tokens I have would be zero effort. When I had a look at the annotated text plugin it looks like it would be a huge effort to understand how to implement my own plugin for doing what I want and integrate it correctly.

It would really be a huge help if there was a plugin that would essentially support something like a "rawtoken" type of field which could then contain an array of tokens as the annotated text plugin produces.

Are there resources, documentation or help for how to implement such a plugin?

I imagine JSON would be a pretty verbose way of passing a tokenized stream, especially with the offset and position information.
You then have to worry about elasticsearch functions that rely on re-tokenizing from the original text - they assume the analysis functions are "in the box" and not something you have to consult external processes for.
The most obvious need for an analyzer is in tokenizing search terms the user provides but others include some highlighters and the significant_text aggregation.

It would be easier adding a Java Analyzer that reproduces the functionality you need.
Out of interest - what functionality do you feel is missing that requires external processing?

This is about two separate issues really:

  1. which process does the tokenization of documents when during the processing of data

I am working on NLP, so I already have a lot of code and infrastructure to do tokenization, entity recognition, linking, disambiguation etc. This is code often implemented in non-Java languages like Python e.g. the output of a Neural Net. The processing is done in a completely separate job from when I want to actually store the data in ES.

  1. what should the token stream look like for some text

I want to be able to use the tokens from my processing in 1) but also be able to add arbitrary overlapping token sequences - similar to what the annotated text plugin does, but for more realistic NLP annotations, where e.g. a protein name and a gene name can both overlap in various ways, or a movie title and a person name could overlap etc. The annotated text plugin is not flexible enough for this.
Since I want to be able to query for these entities in position as it is possible with the annotated text plugin, I cannot just use separate keyword fields for that.

For the query phase, it would be much easier to create a normal custom analyzer to tokenize the query, so for such a field where I would like to be able to send just the token stream, it would still be useful to be able to specify a custom analyzer for the query phase.

Highlighting is not too important for this usecase as all querying and processing of the query result would be done by a program which would then be able to process the results accordingly.

But anyway: how would one best get started understanding everything that is necessary to implement an analyzer? Is there a tutorial or similar for that somewhere?

A piece of text can have multiple annotations e.g
[Death of Stalin](Movie&Person&Event) was released...
If you're able to resolve overlapping tokens to the longest text span you can bundle them as above to describe the text span. Not ideal but anything else complicates the markup syntax.

I suspect the easiest thing to do is copy one of the existing Analysis plugin's source code as a template.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.