Is there any way to index content by specifying the exact tokens to index?
For example:
PUT /index/type/1
{
“content”: “Some content to index”,
“tokens”: [
{
“value”: “data”
“start”: 5,
“end”: 11
}
]
}
I know the annotated text plugin does something similar, but its unsuitable as we already know the tokens we want to index and some tokens may overlap.
Would I need to create a custom plugin?
Currently there is not a special field datatype for injecting your own tokens. You would need to develop your own plugin.
I have filed an issue to discuss a possibility of implementing it.
So the use case is that we have our own Named Entity Recognition engine which has already found the tokens in the text. This includes tokens that overlap to represent entities with overlapping words. I expect it would work very much like the annotated text plugin, but rather than markup the content with the entities (which doesn't allow for overlapping entities, only multiple entities that have exactly the same start/end index), it would return the set of tokens in a separate property as a list. I believe with the annotated text plugin you can still provide your own analyzer
Yes, because injected annotations have to occupy a position in the indexed information dictated by a choice of tokenizer - this is effectively measured as word-number rather than character offset. So the word "foo" is the fourth word in this sentence. The concept of proximity when running positional queries like phrase queries and span queries in Lucene is measured using the token's position info - not by comparing character offsets for proximity. Character offsets are only used for things like highlighting. It can be useful to allow positional queries that mix structured annotations and free-text in proximity queries - demo
Annotations using the annotated text plugin can include more than one token to inject - these can be encoded and separated by & characters. Maybe you could put the annotation braces [ ] around the longest span of any overlapping entities and use the ampersand to separate the two tokens eg (shortEntity&longEntity)?
Can you share any examples of your overlapping annotations?
The alternative to using annotated text to inject tokens would be to write your own Analyzer plugin in Java and take full control of converting the text into indexed elements.
Thanks for the clarification.
I suppose what we are after is something like the annotated text plugin but with the highlighting correct for overlapping entities.
Does the annotated text plugin suffer from lucenes sausagization issues described in http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
The annotations themselves are not tokenised so even if they contain whitespace they are considered as a single token.
The position and length of each of these is dependent on the host Analyzer’s policy for tokenising the text they decorate. Here is the code for injecting the annotations with the position information from the wrapped Analyzer
Thanks for the example.
We have also noticed that nesting seems to work correctly, for example: [President of the [United States](USA)](President). This wasn't mentioned in the documentation so just wanted to check if this was done by design and if so if multiple levels of nesting are supported.
No, I don't think it does and wasn't designed to either. Note only the inner USA annotation is parsed. You can see the outputs using the "analyze" api:
DELETE test
PUT test
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties":{
"text":{
"type":"annotated_text"
}
}
}
}
GET test/_analyze
{
"field": "text",
"text":"[President of the [United States](USA)](President)"
}
Note the outer President annotation is lower cased because it is considered to be text.
One solution we are looking at is marking up each word separately so we can support nested entities, for example if we have the entity 'leg' (ANAT:456) within 'short leg syndrome' (IND:123): [short](IND:123) [leg](IND:123&ANAT:456) [syndrome](IND:123) I understand this may have some downsides, for example skewing relevancy but is there any other major issues which this may cause with the plugin?
We have started prototyping the next generation of our document search UI which relies heavily on NER. We have built is on top of ES + annotated text plugin using the word per token strategy. Things are looking good so far and it seems to deal with overlapping and nested word wells. Still need to see how it effects relevancy but I think in our case this is unlikely to be an issue
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.