Synonym expansion only during search time?


(Lukáš Vlček) #1

Hi,

I would like to make a synonym for "ws". It should translate to "web
service". After some experiments I found out that using a synonym filter is
not a good solution for this because if I use a rule like "ws => web
service" or "ws, web service" then document having "ws" term gets injected
two other terms "web" and "service" during indexing which means that it is
relevant for search "web" (note the omitted "service" part of the query).
The other problem is that if there is such a synonym rule then during search
the query "ws" will be expanded to two other terms "web " and "service"
which means that documents having only "web" in it can match no matter they
do NOT contain "service" term.

Not sure if there is other solution but it seems to me that what I am
looking for in this case is functionality that would expand Lucene query by
adding proximity part like "web service"~1 (or something like that). Is
something like that possible in ES? Or are there other better solutions how
to handle "ws" - "web service" synonym and avoid false search hits?

Regards,
Lukas


(Jan Fiedler) #2

I guess you are opening an interesting can of worms there. I have
worked on similar issues on pure Lucene (2.x versions back then) and I
am not aware are any build-in Lucene solutions (this may have changed
though).

Back then I eventually ended up writing my own index time & query time
analyzers to handle this. I think what you are looking for feature
wise is called 'common phrases'. For example, 'baby doll' (a kind of
special pyjama) is a classical common phrase that has a completely
different meaning than the individual terms 'baby' and 'doll'. Usually
you do not want 'baby doll' to match if someone searches for 'doll'.

To support something like this you basically have to detect certain
sequences of tokens at indexing time and modify them such that they
are not matched against the individual terms anymore (e.g. one can
simply concatenate them with a separator). Then at query time, you
would apply the same analyzing process such that someone searching for
'baby doll' gets the correctly encoded keyword (e.g. 'baby-doll')
while someone searching for 'doll for a baby' gets the classical
tokens ('baby', 'doll' assuming stopword removal).

In your case, you actually want synonyms + common phrases (i.e. expand
'ws' into 'web service' and handle 'web service' as a common phrase).
It would be interesting to learn whether there any open-source common
phrase solutions out there ...

On Jun 20, 3:24 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

I would like to make a synonym for "ws". It should translate to "web
service". After some experiments I found out that using a synonym filter is
not a good solution for this because if I use a rule like "ws => web
service" or "ws, web service" then document having "ws" term gets injected
two other terms "web" and "service" during indexing which means that it is
relevant for search "web" (note the omitted "service" part of the query).
The other problem is that if there is such a synonym rule then during search
the query "ws" will be expanded to two other terms "web " and "service"
which means that documents having only "web" in it can match no matter they
do NOT contain "service" term.

Not sure if there is other solution but it seems to me that what I am
looking for in this case is functionality that would expand Lucene query by
adding proximity part like "web service"~1 (or something like that). Is
something like that possible in ES? Or are there other better solutions how
to handle "ws" - "web service" synonym and avoid false search hits?

Regards,
Lukas


(ppearcy) #3

Hey Lukas,
Jumping on this a little late, but I was just confronted with
similar. For your "web service" example, I think things would work as
expected with a synonym file that looks like this:
ws, web service

Make sure to keep the single token synonym on the front, as that is
the one that will go into the index.

Then with an analyzer with expand set to false that looks something
like this:
analysis :
filter :
english_snowball:
type : snowball
language : English
synonym :
type : synonym
synonyms_path : analysis/synonym.txt
ignore_case : true
expand : false
analyzer :
test_analyzer :
type : custom
filter : [standard, lowercase, synonym, stop,
english_snowball]
tokenizer : standard

When a user enters a query I am performing a phrase search and a
normal term search, eg:
"web service" OR (web service)

which should correctly match anything with ws. Where things got a
little tricky for me is when I had multiple synonym phrases that did
not have a single token synonym form. So, I ended up using a uuid for
the single token. So, if I had a list like this:
nicotine addiction, smoking cessation

I would transform it into something like this:
7b4d385ab81911e0bc60001e0beaa9c4, nicotine addiction, smoking
cessation

Would be curious to know if this approach works for your use case and
if you see any holes with this technique. My testing so far looks
good.

Thanks!
Paul

On Jun 20, 8:37 am, Jan Fiedler fiedler....@gmail.com wrote:

I guess you are opening an interesting can of worms there. I have
worked on similar issues on pure Lucene (2.x versions back then) and I
am not aware are any build-in Lucene solutions (this may have changed
though).

Back then I eventually ended up writing my own index time & query time
analyzers to handle this. I think what you are looking for feature
wise is called 'common phrases'. For example, 'baby doll' (a kind of
special pyjama) is a classical common phrase that has a completely
different meaning than the individual terms 'baby' and 'doll'. Usually
you do not want 'baby doll' to match if someone searches for 'doll'.

To support something like this you basically have to detect certain
sequences of tokens at indexing time and modify them such that they
are not matched against the individual terms anymore (e.g. one can
simply concatenate them with a separator). Then at query time, you
would apply the same analyzing process such that someone searching for
'baby doll' gets the correctly encoded keyword (e.g. 'baby-doll')
while someone searching for 'doll for a baby' gets the classical
tokens ('baby', 'doll' assuming stopword removal).

In your case, you actually want synonyms + common phrases (i.e. expand
'ws' into 'web service' and handle 'web service' as a common phrase).
It would be interesting to learn whether there any open-source common
phrase solutions out there ...

On Jun 20, 3:24 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

I would like to make asynonymfor "ws". It should translate to "web
service". After some experiments I found out that using asynonymfilter is
not a good solution for this because if I use a rule like "ws => web
service" or "ws, web service" then document having "ws" term gets injected
two other terms "web" and "service" during indexing which means that it is
relevant for search "web" (note the omitted "service" part of the query).
The other problem is that if there is such asynonymrule then during search
the query "ws" will be expanded to two other terms "web " and "service"
which means that documents having only "web" in it can match no matter they
do NOT contain "service" term.

Not sure if there is other solution but it seems to me that what I am
looking for in this case is functionality that would expand Lucene query by
adding proximity part like "web service"~1 (or something like that). Is
something like that possible in ES? Or are there other better solutions how
to handle "ws" - "web service"synonymand avoid false search hits?

Regards,
Lukas


(system) #4