How to lower the significance of a certain phrase

Often people using our search type "how to " eg "how to paint
my kitchen". This might result in results for "tips to paint my kitchen"
or "how to paint my bathroom". the phrase "how to" is a generic phrase and
I would like to minimize its significance. I don't want to remove it
completely because I still would like a post called "how to paint my
kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".

I don't want it to be a stopword because it still has value (as in the
example).

The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want those
documents to get penalized for containing the words "how to" - just that
they should get a much smaller boost.

Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You can not penalize terms, you can only reward terms. The trick is to
reward important terms and so all other (unwanted and unknown) terms get
penalized. One method is to analyze sentences for grammar (part-of-speech
tagging) and reward nouns or other keywords with boosting values, and use
an extended similarity algorithm.

You can use UIMA or OpenNLP or Stanford NLP for POS tagging, and try to
implement payload-based scoring, something like this demo code

My demo code does not work, not sure where I made a mistake.

Jörg

On Sun, Apr 12, 2015 at 12:34 PM, Yehosef Shapiro yehosef@gmail.com wrote:

Often people using our search type "how to " eg "how to paint
my kitchen". This might result in results for "tips to paint my kitchen"
or "how to paint my bathroom". the phrase "how to" is a generic phrase and
I would like to minimize its significance. I don't want to remove it
completely because I still would like a post called "how to paint my
kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".

I don't want it to be a stopword because it still has value (as in the
example).

The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want those
documents to get penalized for containing the words "how to" - just that
they should get a much smaller boost.

Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE0GW0Frjv3coC6-iMK81fEVZLR8R2S9fayqR8bTpx2qw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yehosef, this sounds very similar to some title search work I've done.
Title fields are odd because TF is often meaningless, and IDF can also
Be quite skewed. If only a few titles have "how" in the text, then you'll
get very odd results.

Read more here:

On Sunday, April 12, 2015, Yehosef Shapiro yehosef@gmail.com wrote:

Often people using our search type "how to " eg "how to paint
my kitchen". This might result in results for "tips to paint my kitchen"
or "how to paint my bathroom". the phrase "how to" is a generic phrase and
I would like to minimize its significance. I don't want to remove it
completely because I still would like a post called "how to paint my
kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".

I don't want it to be a stopword because it still has value (as in the
example).

The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want those
documents to get penalized for containing the words "how to" - just that
they should get a much smaller boost.

Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com
<javascript:_e(%7B%7D,'cvml','elasticsearch%2Bunsubscribe@googlegroups.com');>
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL-nLmW3Gc28VN9BXKpBF_gB2CCGyeAn0YOqV6VFCkQmcQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for this - I so I could basically strip out the unwanted terms.
Then I could do the search with two clauses, one with the original search
phrase with a lower weight and another with the "cleaned" search phrase
with a higher weight.

On Monday, April 13, 2015 at 12:05:44 AM UTC+3, Jörg Prante wrote:

You can not penalize terms, you can only reward terms. The trick is to
reward important terms and so all other (unwanted and unknown) terms get
penalized. One method is to analyze sentences for grammar (part-of-speech
tagging) and reward nouns or other keywords with boosting values, and use
an extended similarity algorithm.

You can use UIMA or OpenNLP or Stanford NLP for POS tagging, and try to
implement payload-based scoring, something like this demo code

GitHub - jprante/elasticsearch-payload: Term payloads for Elasticsearch

My demo code does not work, not sure where I made a mistake.

Jörg

On Sun, Apr 12, 2015 at 12:34 PM, Yehosef Shapiro <yeh...@gmail.com
<javascript:>> wrote:

Often people using our search type "how to " eg "how to
paint my kitchen". This might result in results for "tips to paint my
kitchen" or "how to paint my bathroom". the phrase "how to" is a generic
phrase and I would like to minimize its significance. I don't want to
remove it completely because I still would like a post called "how to paint
my kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".

I don't want it to be a stopword because it still has value (as in the
example).

The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want
those documents to get penalized for containing the words "how to" - just
that they should get a much smaller boost.

Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/baa4565e-9b2d-45f9-8711-db8950b9ce1a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

So because we're using BM25, I think this is a lower concern in general (
good chart
in Pluggable Similarity Algorithms | Elasticsearch: The Definitive Guide [master] | Elastic)

We also disable norms on title fields
(lucene - Elasticsearch : when to set omit_norms option as false - Stack Overflow)
FWIW.

Thanks for the link - Good info. I'm leaning toward something like you
recommend in your keepWordFilter - but doing it at query time instead of
index time. It doesn't seem like I need to use the memory to store
"Socrates and Plato on Metaphysics" and also "Socrates Plato Metaphysics" -
seems better to make the distinction at query time - and the performance
should be the same because I need two search clauses anyway.

On Monday, April 13, 2015 at 12:15:14 AM UTC+3, Doug Turnbull wrote:

Yehosef, this sounds very similar to some title search work I've done.
Title fields are odd because TF is often meaningless, and IDF can also
Be quite skewed. If only a few titles have "how" in the text, then you'll
get very odd results.

Read more here:

Title Search: when relevancy is only skin deep - OpenSource Connections

On Sunday, April 12, 2015, Yehosef Shapiro <yeh...@gmail.com <javascript:>>
wrote:

Often people using our search type "how to " eg "how to
paint my kitchen". This might result in results for "tips to paint my
kitchen" or "how to paint my bathroom". the phrase "how to" is a generic
phrase and I would like to minimize its significance. I don't want to
remove it completely because I still would like a post called "how to paint
my kitchen cabinets" to match higher than "should I wallpaper or paint my
kitchen".

I don't want it to be a stopword because it still has value (as in the
example).

The Common Terms query might work - but I don't necessarily want to apply
the rules to all other common phrases (it might be a good idea - but this
is a specific common search term that I know people search for and I would
like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want
those documents to get penalized for containing the words "how to" - just
that they should get a much smaller boost.

Any suggestions how to approach this? For the record, I'm using the BM25
similarity algorithm.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7cceb1d2-cefc-420b-bb97-bba2eb2b97fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.