Often people using our search type "how to " eg "how to paint
my kitchen". This might result in results for "tips to paint my kitchen"
or "how to paint my bathroom". the phrase "how to" is a generic phrase and
I would like to minimize its significance. I don't want to remove it
completely because I still would like a post called "how to paint my
kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".
I don't want it to be a stopword because it still has value (as in the
example).
The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)
I don't think the negative boost is what I want because I don't want those
documents to get penalized for containing the words "how to" - just that
they should get a much smaller boost.
Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.
You can not penalize terms, you can only reward terms. The trick is to
reward important terms and so all other (unwanted and unknown) terms get
penalized. One method is to analyze sentences for grammar (part-of-speech
tagging) and reward nouns or other keywords with boosting values, and use
an extended similarity algorithm.
You can use UIMA or OpenNLP or Stanford NLP for POS tagging, and try to
implement payload-based scoring, something like this demo code
My demo code does not work, not sure where I made a mistake.
Jörg
On Sun, Apr 12, 2015 at 12:34 PM, Yehosef Shapiro yehosef@gmail.com wrote:
Often people using our search type "how to " eg "how to paint
my kitchen". This might result in results for "tips to paint my kitchen"
or "how to paint my bathroom". the phrase "how to" is a generic phrase and
I would like to minimize its significance. I don't want to remove it
completely because I still would like a post called "how to paint my
kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".
I don't want it to be a stopword because it still has value (as in the
example).
The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)
I don't think the negative boost is what I want because I don't want those
documents to get penalized for containing the words "how to" - just that
they should get a much smaller boost.
Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.
Yehosef, this sounds very similar to some title search work I've done.
Title fields are odd because TF is often meaningless, and IDF can also
Be quite skewed. If only a few titles have "how" in the text, then you'll
get very odd results.
Often people using our search type "how to " eg "how to paint
my kitchen". This might result in results for "tips to paint my kitchen"
or "how to paint my bathroom". the phrase "how to" is a generic phrase and
I would like to minimize its significance. I don't want to remove it
completely because I still would like a post called "how to paint my
kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".
I don't want it to be a stopword because it still has value (as in the
example).
The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)
I don't think the negative boost is what I want because I don't want those
documents to get penalized for containing the words "how to" - just that
they should get a much smaller boost.
Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
Thanks for this - I so I could basically strip out the unwanted terms.
Then I could do the search with two clauses, one with the original search
phrase with a lower weight and another with the "cleaned" search phrase
with a higher weight.
On Monday, April 13, 2015 at 12:05:44 AM UTC+3, Jörg Prante wrote:
You can not penalize terms, you can only reward terms. The trick is to
reward important terms and so all other (unwanted and unknown) terms get
penalized. One method is to analyze sentences for grammar (part-of-speech
tagging) and reward nouns or other keywords with boosting values, and use
an extended similarity algorithm.
You can use UIMA or OpenNLP or Stanford NLP for POS tagging, and try to
implement payload-based scoring, something like this demo code
My demo code does not work, not sure where I made a mistake.
Jörg
On Sun, Apr 12, 2015 at 12:34 PM, Yehosef Shapiro <yeh...@gmail.com
<javascript:>> wrote:
Often people using our search type "how to " eg "how to
paint my kitchen". This might result in results for "tips to paint my
kitchen" or "how to paint my bathroom". the phrase "how to" is a generic
phrase and I would like to minimize its significance. I don't want to
remove it completely because I still would like a post called "how to paint
my kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".
I don't want it to be a stopword because it still has value (as in the
example).
The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)
I don't think the negative boost is what I want because I don't want
those documents to get penalized for containing the words "how to" - just
that they should get a much smaller boost.
Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.
Thanks for the link - Good info. I'm leaning toward something like you
recommend in your keepWordFilter - but doing it at query time instead of
index time. It doesn't seem like I need to use the memory to store
"Socrates and Plato on Metaphysics" and also "Socrates Plato Metaphysics" -
seems better to make the distinction at query time - and the performance
should be the same because I need two search clauses anyway.
On Monday, April 13, 2015 at 12:15:14 AM UTC+3, Doug Turnbull wrote:
Yehosef, this sounds very similar to some title search work I've done.
Title fields are odd because TF is often meaningless, and IDF can also
Be quite skewed. If only a few titles have "how" in the text, then you'll
get very odd results.
On Sunday, April 12, 2015, Yehosef Shapiro <yeh...@gmail.com <javascript:>>
wrote:
Often people using our search type "how to " eg "how to
paint my kitchen". This might result in results for "tips to paint my
kitchen" or "how to paint my bathroom". the phrase "how to" is a generic
phrase and I would like to minimize its significance. I don't want to
remove it completely because I still would like a post called "how to paint
my kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".
I don't want it to be a stopword because it still has value (as in the
example).
The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)
I don't think the negative boost is what I want because I don't want
those documents to get penalized for containing the words "how to" - just
that they should get a much smaller boost.
Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.