Scoring based on the number of matches in the field

Hi there,

I have the following query:

"query": {
"multi_match": {
"operator": "and",
"type": "cross_fields",
"query": "john smith",
"fields": ["name", "address"]
}
}

That will match these documents:

Name: James Smith
Address: 325 John Street

Name: John Smith Junior
Address: 100 Baryl Street

Is there a way to give the last document a higher score since the terms
"john" "smith" have two matches on the same field?

Notice that behavior is a little bit different from the one using
match_phrase with slop because the query can still match terms in any of
the fields but score higher when there are more matches on the same field.

Thanks,

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

First, note that in Lucene's default similarity there already are two
biases towards matches with fewer fields. Try to take advantage of those
before going on a boosting expedition

  1. Each term tends to get converted into a boolean SHOULD clause. Every
    SHOULD clause match gets added to the score. So the fewer matches, the
    lower the score.

  2. For an even stronger bias, Lucene adds coord or the coordinating
    factor. If only 1 out of 3 search terms match the field being searched, a
    multiple of 1/3 is applied thus punishing the score. So matches where more
    terms match should have a much higher chance of winning.

If you want to know more, read Lucene's javadocs on similarity:
https://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Huh you're thinking, why doesn't my scenario just work then. What you're
doing is cross_field search. Cross field search is something new to
Elasticsearch whereby both fields are blended together and treated like a
single field. So the biasing above applies to the two fields together. If
you want to know more about cross-field search -- here's an article I
recently wrote

If you want to actually have a bias towards a field with more matches, I'd
recommend best_field or most_fields search. They will take both search
terms to each field first, performing different searches in each field.
Then they will be combined (either by adding or taking the max score).

Untill I finish the related chapter in the search relevance book I'm
writing <shameless plug :-p http://manning.com/turnbull> the best place to
read about these topics are the docs or the online guide. In particular,
this appears relevant
http://www.elastic.co/guide/en/elasticsearch/guide/master/multi-field-search.html

Hope that helps

On Mon, Apr 13, 2015 at 7:30 PM, Andre Dantas Rocha <
andre.dantas.rocha@gmail.com> wrote:

Hi there,

I have the following query:

"query": {
"multi_match": {
"operator": "and",
"type": "cross_fields",
"query": "john smith",
"fields": ["name", "address"]
}
}

That will match these documents:

Name: James Smith
Address: 325 John Street

Name: John Smith Junior
Address: 100 Baryl Street

Is there a way to give the last document a higher score since the terms
"john" "smith" have two matches on the same field?

Notice that behavior is a little bit different from the one using
match_phrase with slop because the query can still match terms in any of
the fields but score higher when there are more matches on the same field.

Thanks,

Andre

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL-BjkkULxXKH4WnbMnUBJF2TowjTe%2B51cHaJHE%2B2GBLcw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Sorry for the confusing typo -- "towards matches with fewer fields". fields
should be search terms

On Mon, Apr 13, 2015 at 9:30 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

First, note that in Lucene's default similarity there already are two
biases towards matches with fewer fields. Try to take advantage of those
before going on a boosting expedition

  1. Each term tends to get converted into a boolean SHOULD clause. Every
    SHOULD clause match gets added to the score. So the fewer matches, the
    lower the score.

  2. For an even stronger bias, Lucene adds coord or the coordinating
    factor. If only 1 out of 3 search terms match the field being searched, a
    multiple of 1/3 is applied thus punishing the score. So matches where more
    terms match should have a much higher chance of winning.

If you want to know more, read Lucene's javadocs on similarity:
https://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Huh you're thinking, why doesn't my scenario just work then. What you're
doing is cross_field search. Cross field search is something new to
Elasticsearch whereby both fields are blended together and treated like a
single field. So the biasing above applies to the two fields together. If
you want to know more about cross-field search -- here's an article I
recently wrote
http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/

If you want to actually have a bias towards a field with more matches, I'd
recommend best_field or most_fields search. They will take both search
terms to each field first, performing different searches in each field.
Then they will be combined (either by adding or taking the max score).

Untill I finish the related chapter in the search relevance book I'm
writing <shameless plug :-p http://manning.com/turnbull> the best place
to read about these topics are the docs or the online guide. In particular,
this appears relevant

http://www.elastic.co/guide/en/elasticsearch/guide/master/multi-field-search.html

Hope that helps

On Mon, Apr 13, 2015 at 7:30 PM, Andre Dantas Rocha <
andre.dantas.rocha@gmail.com> wrote:

Hi there,

I have the following query:

"query": {
"multi_match": {
"operator": "and",
"type": "cross_fields",
"query": "john smith",
"fields": ["name", "address"]
}
}

That will match these documents:

Name: James Smith
Address: 325 John Street

Name: John Smith Junior
Address: 100 Baryl Street

Is there a way to give the last document a higher score since the terms
"john" "smith" have two matches on the same field?

Notice that behavior is a little bit different from the one using
match_phrase with slop because the query can still match terms in any of
the fields but score higher when there are more matches on the same field.

Thanks,

Andre

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL9dr-At%2BxtWqsT6%3D%2BGehKEYsZsp2rvxp%3D9KFqPFbgiUjA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Doug,

Thank you for your quick response and comprehensive explanation. It does
make sense.

We are using cross_fields (with the "and" operator) because we want to make
sure that the documents returned contain all the search terms somewhere.

For example, the search for "100 john smith" would return only one
document. ("john smith" matches the name and "100" matches the address")

We expect no results for "200 john smith" as 200 appears nowhere.

But if we search for "john smith" we should get both documents back and the
document with "john smith" should be the first one is the list (since terms
"john" "smith" matches on the same field).

Is there possible to accomplish this with best_fields or most_fields?

Thanks again,

Andre

On Tuesday, April 14, 2015 at 12:21:33 PM UTC+10, Doug Turnbull wrote:

Sorry for the confusing typo -- "towards matches with fewer fields". fields
should be search terms

On Mon, Apr 13, 2015 at 9:30 PM, Doug Turnbull <
dtur...@opensourceconnections.com <javascript:>> wrote:

First, note that in Lucene's default similarity there already are two
biases towards matches with fewer fields. Try to take advantage of those
before going on a boosting expedition

  1. Each term tends to get converted into a boolean SHOULD clause. Every
    SHOULD clause match gets added to the score. So the fewer matches, the
    lower the score.

  2. For an even stronger bias, Lucene adds coord or the coordinating
    factor. If only 1 out of 3 search terms match the field being searched, a
    multiple of 1/3 is applied thus punishing the score. So matches where more
    terms match should have a much higher chance of winning.

If you want to know more, read Lucene's javadocs on similarity:
https://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Huh you're thinking, why doesn't my scenario just work then. What you're
doing is cross_field search. Cross field search is something new to
Elasticsearch whereby both fields are blended together and treated like a
single field. So the biasing above applies to the two fields together. If
you want to know more about cross-field search -- here's an article I
recently wrote
http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/

If you want to actually have a bias towards a field with more matches,
I'd recommend best_field or most_fields search. They will take both search
terms to each field first, performing different searches in each field.
Then they will be combined (either by adding or taking the max score).

Untill I finish the related chapter in the search relevance book I'm
writing <shameless plug :-p http://manning.com/turnbull> the best place
to read about these topics are the docs or the online guide. In particular,
this appears relevant

http://www.elastic.co/guide/en/elasticsearch/guide/master/multi-field-search.html

Hope that helps

On Mon, Apr 13, 2015 at 7:30 PM, Andre Dantas Rocha <
andre.dan...@gmail.com <javascript:>> wrote:

Hi there,

I have the following query:

"query": {
"multi_match": {
"operator": "and",
"type": "cross_fields",
"query": "john smith",
"fields": ["name", "address"]
}
}

That will match these documents:

Name: James Smith
Address: 325 John Street

Name: John Smith Junior
Address: 100 Baryl Street

Is there a way to give the last document a higher score since the terms
"john" "smith" have two matches on the same field?

Notice that behavior is a little bit different from the one using
match_phrase with slop because the query can still match terms in any of
the fields but score higher when there are more matches on the same field.

Thanks,

Andre

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource
Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/49b56ab9-bdd0-4d7d-b63f-963d05b70744%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Doug,

Thank you for your quick response and comprehensive explanation. It does
make sense.

We are using cross_fields (with the "and" operator) because we want to make
sure that the documents returned contain all the search terms somewhere.

For example, the search for "100 john smith" would return only one
document. ("john smith" matches the name and "100" matches the address")

We expect no results for "200 john smith" as 200 appears nowhere.

But if we search for "john smith" we should get both documents back and the
document with "john smith" should be the first one is the list (since terms
"john" "smith" matches on the same field).

Is there possible to accomplish this with best_fields or most_fields?

Thanks again,

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b9f21865-c060-434e-b456-80a230ac6439%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If you want this to be a mininum you could

This would eliminate any search results without a match somewhere, cutting
off the long tail as you need to, but score using most fields.

Does that make sense?

On Tue, Apr 14, 2015 at 12:24 AM, Andre Dantas Rocha <
andre.dantas.rocha@gmail.com> wrote:

Hi Doug,

Thank you for your quick response and comprehensive explanation. It does
make sense.

We are using cross_fields (with the "and" operator) because we want to
make sure that the documents returned contain all the search terms
somewhere.

For example, the search for "100 john smith" would return only one
document. ("john smith" matches the name and "100" matches the address")

We expect no results for "200 john smith" as 200 appears nowhere.

But if we search for "john smith" we should get both documents back and
the document with "john smith" should be the first one is the list
(since terms "john" "smith" matches on the same field).

Is there possible to accomplish this with best_fields or most_fields?

Thanks again,

Andre

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b9f21865-c060-434e-b456-80a230ac6439%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b9f21865-c060-434e-b456-80a230ac6439%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL-otHfwDshPnv5e8Ys4gm8wnOdZzkSzzW8Fs9YEYodrUg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Doug,

Yes. it does make sense. I'll try to rewrite it and get back to you.

Thank you again for your help,
Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3e69b3de-30f1-4952-845a-6d75eff846f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Doug,

Your suggestion worked perfectly!

Thank very much.

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8c1b36c1-d0fb-45e7-8e3b-2b4934e02c7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.