Multiple tokens with same position

I have a situation where I have multiple tokens at the same position (post a PatternCaptureTokenFilter).
If I had a sentence like this for example:
"blue car_automobile_wheeled_thing road"

And I split on underscore, I'd end up with a token list looking like this

Token Position Token String
0 blue
1 car
1 automobile
1 wheeled
1 thing
2 road

This is akin to the multi word synonym problem as described in this brilliant blog post

Note that 'wheeled thing' is the multi word synonym effectively.

Lucene , afaik, doesn't use the PositionLengthAttribute, so the bag of tokens at position 1 is unordered.
If wanted to search for the phrase 'wheeled thing', I'd find it, but if I searched for 'blue thing', I'd find that too, erroneously, because 'thing' is one position away from 'blue'. Has anyone got a solution to this kind of multi-word synonym issue ?

Many thanks,
Phil

Assuming you had indexed blue car road and had synonyms car -> automobile and car -> wheeled thing, then in your example, thing should be at position 2 not 1 (i.e., it overlaps road not car), the way synonym filter works today.

And then "blue thing" phrase query should not match (good), but e.g. "wheeled thing road" won't match but should (bad).

Essentially, the synonym filter cannot create new positions, so it takes multi-term synonyms and lays them on top of the existing tokens.

There is a working patch on https://issues.apache.org/jira/browse/LUCENE-6664 to let synonym filter create new positions, so that it produces a correct graph, but it was controversial and got shelved.

1 Like

Many thanks Mike !
You're right of course, my example was wrong. A better example to explain our situation would be:

"blue big_wheeled_thing_CONCEPTCAR road" where we embed the concept CAR into the same token position as the words "big", "wheeled" and "thing". We can search with CONCEPTCAR fine. The issue we face is that we want to be able to allow search hits for 'big wheeled thing' , but not hits for 'big thing'. It's the same issue underneath as that your blog mentioned I think. Are there any other tricks we might use to get around this lucene limitation and thereby achieve no hits for 'big thing' ?