I have a situation where I have multiple tokens at the same position (post a PatternCaptureTokenFilter).
If I had a sentence like this for example:
"blue car_automobile_wheeled_thing road"
And I split on underscore, I'd end up with a token list looking like this
Token Position Token String
This is akin to the multi word synonym problem as described in this brilliant blog post
Note that 'wheeled thing' is the multi word synonym effectively.
Lucene , afaik, doesn't use the PositionLengthAttribute, so the bag of tokens at position 1 is unordered.
If wanted to search for the phrase 'wheeled thing', I'd find it, but if I searched for 'blue thing', I'd find that too, erroneously, because 'thing' is one position away from 'blue'. Has anyone got a solution to this kind of multi-word synonym issue ?