Data Type to specify exact tokens to index

robmartin11 · August 18, 2019, 8:44am

Is there any way to index content by specifying the exact tokens to index?
For example:

PUT /index/type/1
{
	“content”: “Some content to index”,
	“tokens”: [
		{
			“value”: “data”
			“start”: 5,
			“end”: 11
		}
	]
}

I know the annotated text plugin does something similar, but its unsuitable as we already know the tokens we want to index and some tokens may overlap.
Would I need to create a custom plugin?

Thanks in advance……

mayya · August 24, 2019, 11:59pm

Currently there is not a special field datatype for injecting your own tokens. You would need to develop your own plugin.
I have filed an issue to discuss a possibility of implementing it.

mayya · August 26, 2019, 8:58pm

@robmartin11 We would like to know more about your use case.

First, why you don't you do analysis on the server side and take advantage of the existing analyzers?
Second, provided that we implement a new field type that allows to inject tokens, how would you query it? Would you use Keyword analyzer?

robmartin11 · August 30, 2019, 7:33am

So the use case is that we have our own Named Entity Recognition engine which has already found the tokens in the text. This includes tokens that overlap to represent entities with overlapping words. I expect it would work very much like the annotated text plugin, but rather than markup the content with the entities (which doesn't allow for overlapping entities, only multiple entities that have exactly the same start/end index), it would return the set of tokens in a separate property as a list. I believe with the annotated text plugin you can still provide your own analyzer

Mark_Harwood · August 30, 2019, 8:54am

Yes, because injected annotations have to occupy a position in the indexed information dictated by a choice of tokenizer - this is effectively measured as word-number rather than character offset. So the word "foo" is the fourth word in this sentence. The concept of proximity when running positional queries like phrase queries and span queries in Lucene is measured using the token's position info - not by comparing character offsets for proximity. Character offsets are only used for things like highlighting. It can be useful to allow positional queries that mix structured annotations and free-text in proximity queries - demo

Annotations using the annotated text plugin can include more than one token to inject - these can be encoded and separated by & characters. Maybe you could put the annotation braces [ ] around the longest span of any overlapping entities and use the ampersand to separate the two tokens eg (shortEntity&longEntity)?

Can you share any examples of your overlapping annotations?

The alternative to using annotated text to inject tokens would be to write your own Analyzer plugin in Java and take full control of converting the text into indexed elements.

robmartin11 · August 30, 2019, 6:57pm

Thanks for the clarification.
I suppose what we are after is something like the annotated text plugin but with the highlighting correct for overlapping entities.
Does the annotated text plugin suffer from lucenes sausagization issues described in http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Mark_Harwood · August 30, 2019, 9:13pm

The annotations themselves are not tokenised so even if they contain whitespace they are considered as a single token.
The position and length of each of these is dependent on the host Analyzer’s policy for tokenising the text they decorate. Here is the code for injecting the annotations with the position information from the wrapped Analyzer

Mark_Harwood · September 2, 2019, 4:49pm

I put together a multi-value annotation example here

robmartin11 · September 4, 2019, 11:43am

Thanks for the example.
We have also noticed that nesting seems to work correctly, for example: [President of the [United States](USA)](President). This wasn't mentioned in the documentation so just wanted to check if this was done by design and if so if multiple levels of nesting are supported.

Mark_Harwood · September 4, 2019, 7:18pm

No, I don't think it does and wasn't designed to either. Note only the inner USA annotation is parsed. You can see the outputs using the "analyze" api:

DELETE test
PUT test
{
  "settings": {
	"number_of_shards": 1
  },
  "mappings": {
	  "properties":{
		"text":{
		  "type":"annotated_text"
		}
	  }
  }
}
GET test/_analyze
{
  "field": "text",
  "text":"[President of the [United States](USA)](President)"
}

Note the outer President annotation is lower cased because it is considered to be text.

robmartin11 · September 11, 2019, 1:51pm

One solution we are looking at is marking up each word separately so we can support nested entities, for example if we have the entity 'leg' (ANAT:456) within 'short leg syndrome' (IND:123): [short](IND:123) [leg](IND:123&ANAT:456) [syndrome](IND:123) I understand this may have some downsides, for example skewing relevancy but is there any other major issues which this may cause with the plugin?

Mark_Harwood · September 11, 2019, 2:23pm

I don't envisage any issues with the plugin.
I tried your example and the highlighting seems to work OK

Mark_Harwood · September 23, 2019, 3:28pm

Is this approach working for you?

robmartin11 · September 25, 2019, 8:57am

We have started prototyping the next generation of our document search UI which relies heavily on NER. We have built is on top of ES + annotated text plugin using the word per token strategy. Things are looking good so far and it seems to deal with overlapping and nested word wells. Still need to see how it effects relevancy but I think in our case this is unlikely to be an issue

system · October 23, 2019, 8:57am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Index pre-analyzed text by sending the actual terms/tokens? Elasticsearch	6	728	December 10, 2020
Analyze on text field Elasticsearch	7	605	June 13, 2018
Can you change the "type" of a token emitted by an analyzer? Elasticsearch	1	336	August 14, 2019
TokenStream implementation in ElasticSearch Elasticsearch	6	834	July 6, 2017
Guidance for using custom analyzers/tokenizers Elasticsearch	2	274	July 6, 2017

Data Type to specify exact tokens to index

Related topics