When I search for 'sutent' and look at the _explain results, I see that
each hit contributes to the score, even though the source document only has
one mention of 'sutent' and none of its synonyms. The net result is that
words with more synonyms artificially get a boost in the results.
0.8291347 = weight(brief_title:sunitinib in 27490), product of:
0.8291347 = weight(brief_title:sunitinib in 27490), product of:
0.5862867 = weight(brief_title:su11248 in 27490), product of:
0.5862867 = weight(brief_title:su011248 in 27490), product of:
0.5862867 = weight(brief_title:sutent in 27490), product of:
Trying expand=true and expand=false in the mapping makes no difference. Is
there a setting I can change to avoid this behaviour?
I'll put together a gist if the solution is not immediately obvious to
someone.
Thanks in advance,
Kevin
BONUS QUESTION: is there an explanation somewhere of when I should
expand=false? I read the explanation in the doc but I'm still not getting
it.
When I search for 'sutent' and look at the _explain results, I see
that each hit contributes to the score, even though the source
document only has one mention of 'sutent' and none of its synonyms.
The net result is that words with more synonyms artificially get a
boost in the results.
There are various ways to approach this problem. Either you:
* expand your synonym list at index time (ie you store all
variations of the synonym in your index), but then you search on
just one variation (by using a different analyzer at search or
index time),
* contract your synonym list at index and search time: eg foo, bar
or baz all get indexed as just 'foo'. A search for 'bar'
becomes a search for 'foo'
I have put together a gist demonstrating how this all works:
The question remains: which should I prefer? expand: true or false?
I'm open to disagreement, but my vote would be for expand: false. ie
index just the first word in the synonym list, not all the words.
My reason for that is:
fewer terms to index
replacing synonyms with all variations or just one variation
implies the same loss of original information (ie which synonym
appeared in the original text).
Synonyms can be of different lengths (eg "wi fi" vs "wifi"), which
means that (with expand: true), the phrase "wifi router" would be
indexed as:
Pos: 1 2 3
wifi router
wi fi router
which can mess up eg phrase queries which depends on token positions,
and can also mess up snippet highlighting.
Thank you Clint, that helps my understanding a lot. I will try expand=false.
And thanks for the gist, too.
Kevin
On Saturday, November 17, 2012 4:14:50 AM UTC-8, Clinton Gormley wrote:
There are various ways to approach this problem. Either you:
* expand your synonym list at index time (ie you store all
variations of the synonym in your index), but then you search on
just one variation (by using a different analyzer at search or
index time),
* contract your synonym list at index and search time: eg foo, bar
or baz all get indexed as just 'foo'. A search for 'bar'
becomes a search for 'foo'
On Mon, 2012-11-19 at 16:54 -0800, Kevin Lawrence wrote:
A quick note to confirm that this worked for me and to thank you for
an amazingly detailed answer. I learned a ton of stuff from reading
your gist.
glad it helped
Thank you, Clint!
Kevin
On Saturday, November 17, 2012 4:14:50 AM UTC-8, Clinton
Gormley wrote:
* contract your synonym list at index and search
time: eg foo, bar
or baz all get indexed as just 'foo'. A
search for 'bar'
becomes a search for 'foo'
I have put together a gist demonstrating how this all
works:
https://gist.github.com/4095280
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.