Is there a way to give the last document a higher score since the terms
"john" "smith" have two matches on the same field?
Notice that behavior is a little bit different from the one using
match_phrase with slop because the query can still match terms in any of
the fields but score higher when there are more matches on the same field.
First, note that in Lucene's default similarity there already are two
biases towards matches with fewer fields. Try to take advantage of those
before going on a boosting expedition
Each term tends to get converted into a boolean SHOULD clause. Every
SHOULD clause match gets added to the score. So the fewer matches, the
lower the score.
For an even stronger bias, Lucene adds coord or the coordinating
factor. If only 1 out of 3 search terms match the field being searched, a
multiple of 1/3 is applied thus punishing the score. So matches where more
terms match should have a much higher chance of winning.
Huh you're thinking, why doesn't my scenario just work then. What you're
doing is cross_field search. Cross field search is something new to
Elasticsearch whereby both fields are blended together and treated like a
single field. So the biasing above applies to the two fields together. If
you want to know more about cross-field search -- here's an article I
recently wrote
If you want to actually have a bias towards a field with more matches, I'd
recommend best_field or most_fields search. They will take both search
terms to each field first, performing different searches in each field.
Then they will be combined (either by adding or taking the max score).
Is there a way to give the last document a higher score since the terms
"john" "smith" have two matches on the same field?
Notice that behavior is a little bit different from the one using
match_phrase with slop because the query can still match terms in any of
the fields but score higher when there are more matches on the same field.
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
First, note that in Lucene's default similarity there already are two
biases towards matches with fewer fields. Try to take advantage of those
before going on a boosting expedition
Each term tends to get converted into a boolean SHOULD clause. Every
SHOULD clause match gets added to the score. So the fewer matches, the
lower the score.
For an even stronger bias, Lucene adds coord or the coordinating
factor. If only 1 out of 3 search terms match the field being searched, a
multiple of 1/3 is applied thus punishing the score. So matches where more
terms match should have a much higher chance of winning.
Huh you're thinking, why doesn't my scenario just work then. What you're
doing is cross_field search. Cross field search is something new to
Elasticsearch whereby both fields are blended together and treated like a
single field. So the biasing above applies to the two fields together. If
you want to know more about cross-field search -- here's an article I
recently wrote Elasticsearch Cross Field Search Is A Lie - OpenSource Connections
If you want to actually have a bias towards a field with more matches, I'd
recommend best_field or most_fields search. They will take both search
terms to each field first, performing different searches in each field.
Then they will be combined (either by adding or taking the max score).
Untill I finish the related chapter in the search relevance book I'm
writing <shameless plug :-p Relevant Search> the best place
to read about these topics are the docs or the online guide. In particular,
this appears relevant
Is there a way to give the last document a higher score since the terms
"john" "smith" have two matches on the same field?
Notice that behavior is a little bit different from the one using
match_phrase with slop because the query can still match terms in any of
the fields but score higher when there are more matches on the same field.
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
Thank you for your quick response and comprehensive explanation. It does
make sense.
We are using cross_fields (with the "and" operator) because we want to make
sure that the documents returned contain all the search terms somewhere.
For example, the search for "100 john smith" would return only one
document. ("john smith" matches the name and "100" matches the address")
We expect no results for "200 john smith" as 200 appears nowhere.
But if we search for "john smith" we should get both documents back and the
document with "john smith" should be the first one is the list (since terms
"john" "smith" matches on the same field).
Is there possible to accomplish this with best_fields or most_fields?
Thanks again,
Andre
On Tuesday, April 14, 2015 at 12:21:33 PM UTC+10, Doug Turnbull wrote:
Sorry for the confusing typo -- "towards matches with fewer fields". fields
should be search terms
First, note that in Lucene's default similarity there already are two
biases towards matches with fewer fields. Try to take advantage of those
before going on a boosting expedition
Each term tends to get converted into a boolean SHOULD clause. Every
SHOULD clause match gets added to the score. So the fewer matches, the
lower the score.
For an even stronger bias, Lucene adds coord or the coordinating
factor. If only 1 out of 3 search terms match the field being searched, a
multiple of 1/3 is applied thus punishing the score. So matches where more
terms match should have a much higher chance of winning.
Huh you're thinking, why doesn't my scenario just work then. What you're
doing is cross_field search. Cross field search is something new to
Elasticsearch whereby both fields are blended together and treated like a
single field. So the biasing above applies to the two fields together. If
you want to know more about cross-field search -- here's an article I
recently wrote Elasticsearch Cross Field Search Is A Lie - OpenSource Connections
If you want to actually have a bias towards a field with more matches,
I'd recommend best_field or most_fields search. They will take both search
terms to each field first, performing different searches in each field.
Then they will be combined (either by adding or taking the max score).
Untill I finish the related chapter in the search relevance book I'm
writing <shameless plug :-p Relevant Search> the best place
to read about these topics are the docs or the online guide. In particular,
this appears relevant
Is there a way to give the last document a higher score since the terms
"john" "smith" have two matches on the same field?
Notice that behavior is a little bit different from the one using
match_phrase with slop because the query can still match terms in any of
the fields but score higher when there are more matches on the same field.
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource
Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
Thank you for your quick response and comprehensive explanation. It does
make sense.
We are using cross_fields (with the "and" operator) because we want to make
sure that the documents returned contain all the search terms somewhere.
For example, the search for "100 john smith" would return only one
document. ("john smith" matches the name and "100" matches the address")
We expect no results for "200 john smith" as 200 appears nowhere.
But if we search for "john smith" we should get both documents back and the
document with "john smith" should be the first one is the list (since terms
"john" "smith" matches on the same field).
Is there possible to accomplish this with best_fields or most_fields?
Thank you for your quick response and comprehensive explanation. It does
make sense.
We are using cross_fields (with the "and" operator) because we want to
make sure that the documents returned contain all the search terms
somewhere.
For example, the search for "100 john smith" would return only one
document. ("john smith" matches the name and "100" matches the address")
We expect no results for "200 john smith" as 200 appears nowhere.
But if we search for "john smith" we should get both documents back and
the document with "john smith" should be the first one is the list
(since terms "john" "smith" matches on the same field).
Is there possible to accomplish this with best_fields or most_fields?
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.