Multiple tokens with same position

philv · May 13, 2016, 2:49pm

I have a situation where I have multiple tokens at the same position (post a PatternCaptureTokenFilter).
If I had a sentence like this for example:
"blue car_automobile_wheeled_thing road"

And I split on underscore, I'd end up with a token list looking like this

Token Position Token String
0 blue
1 car
1 automobile
1 wheeled
1 thing
2 road

This is akin to the multi word synonym problem as described in this brilliant blog post

Note that 'wheeled thing' is the multi word synonym effectively.

Lucene , afaik, doesn't use the PositionLengthAttribute, so the bag of tokens at position 1 is unordered.
If wanted to search for the phrase 'wheeled thing', I'd find it, but if I searched for 'blue thing', I'd find that too, erroneously, because 'thing' is one position away from 'blue'. Has anyone got a solution to this kind of multi-word synonym issue ?

Many thanks,
Phil

mikemccand · May 13, 2016, 3:18pm

Assuming you had indexed blue car road and had synonyms car -> automobile and car -> wheeled thing, then in your example, thing should be at position 2 not 1 (i.e., it overlaps road not car), the way synonym filter works today.

And then "blue thing" phrase query should not match (good), but e.g. "wheeled thing road" won't match but should (bad).

Essentially, the synonym filter cannot create new positions, so it takes multi-term synonyms and lays them on top of the existing tokens.

There is a working patch on https://issues.apache.org/jira/browse/LUCENE-6664 to let synonym filter create new positions, so that it produces a correct graph, but it was controversial and got shelved.

philv · May 13, 2016, 3:58pm

Many thanks Mike !
You're right of course, my example was wrong. A better example to explain our situation would be:

"blue big_wheeled_thing_CONCEPTCAR road" where we embed the concept CAR into the same token position as the words "big", "wheeled" and "thing". We can search with CONCEPTCAR fine. The issue we face is that we want to be able to allow search hits for 'big wheeled thing' , but not hits for 'big thing'. It's the same issue underneath as that your blog mentioned I think. Are there any other tricks we might use to get around this lucene limitation and thereby achieve no hits for 'big thing' ?

Topic		Replies	Views
Match every token position in the field when using synonyms Elasticsearch	2	1014	July 6, 2017
Position getting incremented in Synonym filter when used after edge_ngram filter Elasticsearch	3	809	September 20, 2019
PatternCaptureGroupTokenFilter is creating the same offset positions, which is causing highlighting issue Elasticsearch	2	17	October 7, 2024
Query_string does not work with multiple tokens with the same position Elasticsearch	3	297	December 7, 2022
Preserving Position while searching Elasticsearch	3	345	July 6, 2017

Multiple tokens with same position

Related topics