Yes, because injected annotations have to occupy a position in the indexed information dictated by a choice of tokenizer - this is effectively measured as word-number rather than character offset. So the word "foo" is the fourth word in this sentence. The concept of proximity when running positional queries like phrase queries and span queries in Lucene is measured using the token's
position info - not by comparing character offsets for proximity. Character offsets are only used for things like highlighting. It can be useful to allow positional queries that mix structured annotations and free-text in proximity queries - demo
Annotations using the annotated text plugin can include more than one token to inject - these can be encoded and separated by & characters. Maybe you could put the annotation braces
[ ] around the longest span of any overlapping entities and use the ampersand to separate the two tokens eg
Can you share any examples of your overlapping annotations?
The alternative to using annotated text to inject tokens would be to write your own Analyzer plugin in Java and take full control of converting the text into indexed elements.