I have a situation where I have multiple tokens at the same position (post a PatternCaptureTokenFilter).
If I had a sentence like this for example:
"blue car_automobile_wheeled_thing road"
And I split on underscore, I'd end up with a token list looking like this
Token Position Token String
0 blue
1 car
1 automobile
1 wheeled
1 thing
2 road
This is akin to the multi word synonym problem as described in this brilliant blog post
Note that 'wheeled thing' is the multi word synonym effectively.
Lucene , afaik, doesn't use the PositionLengthAttribute, so the bag of tokens at position 1 is unordered.
If wanted to search for the phrase 'wheeled thing', I'd find it, but if I searched for 'blue thing', I'd find that too, erroneously, because 'thing' is one position away from 'blue'. Has anyone got a solution to this kind of multi-word synonym issue ?
Many thanks,
Phil