Refactoring a search

Nick_Hoffman · January 31, 2012, 7:44am

Hey guys. One of my queries is scoring documents strangely (IMO).
Obviously, this means that my query and/or mapping needs some work. Would
you mind giving me some advice on what type of query should be used here,
please?

The query is for my web app's generic "search" bar, and returns products
that match the search text.

Within each product document, I want to search through:

name
catalog.name
items.name
items.property_attribs.character.analyzed
_all

Is a dis_max query with field sub-queries ideal?

Here's a gist with more detail, because that usually helps:

gist.github.com

https://gist.github.com/nickhoffman/df6321bdd0b6b5d599f8

1_description.txt

See this mailing list post for details:
https://groups.google.com/forum/#!topic/elasticsearch/V7Ly0eH-qkM

The 1st document listed below has 1 occurrence of "Grimlock".
The 2nd document lsited below has 7 occurrences of "Grimlock".

Despite this, the 1st document is scored higher than the 2nd document.

How can that be, considering that the 1st document matches on 1 field that's boosted by 4,
whereas the 2nd document matches on 1 field that's boosted by 4, and 6 fields that're boosted by 2?

This file has been truncated. show original

2_query.js

// This is the query that I'm trying to improve.

{
  "from":0,
  "size":10,
  "query":{
    "dis_max":{
      "queries":[
        { "field":{ "name":"Grimlock" } },
        { "field":{ "catalog.name":"Grimlock" } },

This file has been truncated. show original

3_the_mapping.js

// This is the product mapping.


{
  "product": {
    "dynamic_templates": [
      {
        "string_property": {
          "match_mapping_type": "string",
          "path_match"        : "*property_attribs.*",

This file has been truncated. show original

There are more than three files. show original

Thanks again for your advice.
Nick

kimchy · January 31, 2012, 5:41pm

Hard to tell what scores wrong. Did you try and boost search on fields are are more important if they match?

On Tuesday, January 31, 2012 at 9:44 AM, Nick Hoffman wrote:

Hey guys. One of my queries is scoring documents strangely (IMO). Obviously, this means that my query and/or mapping needs some work. Would you mind giving me some advice on what type of query should be used here, please?

The query is for my web app's generic "search" bar, and returns products that match the search text.

Within each product document, I want to search through:

name

catalog.name (http://catalog.name)

items.name (http://items.name)

items.property_attribs.character.analyzed

_all

Is a dis_max query with field sub-queries ideal?

Here's a gist with more detail, because that usually helps:
Is this query ideal for a generic search? · GitHub

Thanks again for your advice.
Nick

Nick_Hoffman · January 31, 2012, 6:30pm

On Tuesday, 31 January 2012 12:41:35 UTC-5, kimchy wrote:

Hard to tell what scores wrong. Did you try and boost search on fields
are are more important if they match?

Yeah, I'm boosting on the most important fields:

by 4 on "name"
by 2 on "items.name"
by 2 on "items.property_attribs.character.analyzed"

The 1st document in the gist has 1 occurrence of "Grimlock", while the 2nd
document has 7 occurrences of "Grimlock". Despite this, the 1st document is
scored higher than the 2nd document.

How can that be, considering that the 1st document matches on 1 field
that's boosted by 4, whereas the 2nd document matches on 1 field that's
boosted by 4, and 6 fields that're boosted by 2?

I just updated the gist with this info, and the query's explanation:

gist.github.com

https://gist.github.com/nickhoffman/df6321bdd0b6b5d599f8

1_description.txt

See this mailing list post for details:
https://groups.google.com/forum/#!topic/elasticsearch/V7Ly0eH-qkM

The 1st document listed below has 1 occurrence of "Grimlock".
The 2nd document lsited below has 7 occurrences of "Grimlock".

Despite this, the 1st document is scored higher than the 2nd document.

How can that be, considering that the 1st document matches on 1 field that's boosted by 4,
whereas the 2nd document matches on 1 field that's boosted by 4, and 6 fields that're boosted by 2?

This file has been truncated. show original

2_query.js

// This is the query that I'm trying to improve.

{
  "from":0,
  "size":10,
  "query":{
    "dis_max":{
      "queries":[
        { "field":{ "name":"Grimlock" } },
        { "field":{ "catalog.name":"Grimlock" } },

This file has been truncated. show original

3_the_mapping.js

// This is the product mapping.


{
  "product": {
    "dynamic_templates": [
      {
        "string_property": {
          "match_mapping_type": "string",
          "path_match"        : "*property_attribs.*",

This file has been truncated. show original

There are more than three files. show original

Thanks again for your help with this, Shay. I've spent hours trying to
figure this out, but haven't made any progress.

kimchy · February 1, 2012, 10:01am

I did not see you boosting it in the query you sent, use it there...

On Tuesday, January 31, 2012 at 9:44 AM, Nick Hoffman wrote:

Hey guys. One of my queries is scoring documents strangely (IMO). Obviously, this means that my query and/or mapping needs some work. Would you mind giving me some advice on what type of query should be used here, please?

The query is for my web app's generic "search" bar, and returns products that match the search text.

Within each product document, I want to search through:

name

catalog.name (http://catalog.name)

items.name (http://items.name)

items.property_attribs.character.analyzed

_all

Is a dis_max query with field sub-queries ideal?

Here's a gist with more detail, because that usually helps:
Is this query ideal for a generic search? · GitHub

Thanks again for your advice.
Nick

Nick_Hoffman · February 1, 2012, 3:07pm

On Wednesday, 1 February 2012 05:01:57 UTC-5, kimchy wrote:

I did not see you boosting it in the query you sent, use it there...

The mapping already boosts the fields, though. If I boost them in the
query, too, wouldn't that apply the boost twice?

Jan_Fiedler · February 1, 2012, 4:13pm

I have not spent a lot of time on it but glancing over the gist I noticed
the following: Your mapping has a boost of 4 for the *analyzed *version of
the top level name field. Your query runs on the not-analyzed version and
the boost will never kick in. Is this by intention ?

Your items.name is *always *analyzed via edge ngram (no separate analyzed
version). This generates many tokens that will match your user input
'Grimlock'. I would assume that these multiple hits on p1.items.name are
rated higher than the plain exact hit on p2.name (I am ignoring the other
fields matching for simplicity).

Nick_Hoffman · February 1, 2012, 9:55pm

Hey Jan. Thanks for your help.

I have not spent a lot of time on it but glancing over the gist I noticed
the following: Your mapping has a boost of 4 for the *analyzed *version
of the top level name field. Your query runs on the not-analyzed version
and the boost will never kick in. Is this by intention ?

Good catch. That was not intentional. I've swapped that around, and updated
the gist.
Is this query ideal for a generic search? · GitHub

Your items.name is *always *analyzed via edge ngram (no separate analyzed

version). This generates many tokens that will match your user input
'Grimlock'. I would assume that these multiple hits on p1.items.name are
rated higher than the plain exact hit on p2.name (I am ignoring the other
fields matching for simplicity).

Ah, that makes sense. I've changed this to be:
"index_analyzer" : "ascii_edge_ngram",
"search_analyzer" : "ascii_std",

Now there's only 2 occurrences of "items.name" in the search's explanation,
but that's still one too many, right?

Also, I just noticed that the root-level "name" field isn't mentioned in
the search's explanation. Why would that be?

kimchy · February 5, 2012, 10:45am

Use boosting on the query side, indexing time boosting is usually too restrictive.

On Wednesday, February 1, 2012 at 11:55 PM, Nick Hoffman wrote:

Hey Jan. Thanks for your help.

I have not spent a lot of time on it but glancing over the gist I noticed the following: Your mapping has a boost of 4 for the analyzed version of the top level name field. Your query runs on the not-analyzed version and the boost will never kick in. Is this by intention ?

Good catch. That was not intentional. I've swapped that around, and updated the gist.
Is this query ideal for a generic search? · GitHub

Your items.name (http://items.name/) is always analyzed via edge ngram (no separate analyzed version). This generates many tokens that will match your user input 'Grimlock'. I would assume that these multiple hits on p1.items.name (http://p1.items.name/) are rated higher than the plain exact hit on p2.name (http://p2.name/) (I am ignoring the other fields matching for simplicity).

Ah, that makes sense. I've changed this to be:
"index_analyzer" : "ascii_edge_ngram",
"search_analyzer" : "ascii_std",

Now there's only 2 occurrences of "items.name (http://items.name)" in the search's explanation, but that's still one too many, right?

Also, I just noticed that the root-level "name" field isn't mentioned in the search's explanation. Why would that be?

Nick_Hoffman · February 5, 2012, 5:50pm

On Sunday, 5 February 2012 05:45:04 UTC-5, kimchy wrote:

Use boosting on the query side, indexing time boosting is usually too
restrictive.

So boost values should be specified in queries instead of in mappings?

What's the difference, exactly? I looked around for info on this, but
couldn't find anything.

Thanks, kimchy!

kimchy · February 5, 2012, 5:58pm

When you specify boost values at index time, they get stored in the index, but with a reduced resolution. Most times, its much better, if possible, to provide them at query time, which gives you the flexibility of changing them on the fly without needing to reindex.

On Sunday, February 5, 2012 at 7:50 PM, Nick Hoffman wrote:

On Sunday, 5 February 2012 05:45:04 UTC-5, kimchy wrote:

Use boosting on the query side, indexing time boosting is usually too restrictive.

So boost values should be specified in queries instead of in mappings?

What's the difference, exactly? I looked around for info on this, but couldn't find anything.

Thanks, kimchy!

dadoonet · February 5, 2012, 7:01pm

Very interesting !
I will change my code tomorrow with this good advice !

David
@dadoonet

Le 5 févr. 2012 à 18:58, Shay Banon kimchy@gmail.com a écrit :

When you specify boost values at index time, they get stored in the index, but with a reduced resolution. Most times, its much better, if possible, to provide them at query time, which gives you the flexibility of changing them on the fly without needing to reindex.
On Sunday, February 5, 2012 at 7:50 PM, Nick Hoffman wrote:

On Sunday, 5 February 2012 05:45:04 UTC-5, kimchy wrote:

Use boosting on the query side, indexing time boosting is usually too restrictive.

So boost values should be specified in queries instead of in mappings?

What's the difference, exactly? I looked around for info on this, but couldn't find anything.

Thanks, kimchy!

Nick_Hoffman · February 5, 2012, 10:02pm

Interesting. Thanks for that insight, kimchy.

I'm boosting a field query inside of a dis_max query, but ES is bailing. Is
the "boost" option here not allowed?

curl -X DELETE 'localhost:9200/test?pretty=1'

curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Hoffman"
}'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "John Smith" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Other" }'

curl -X POST 'localhost:9200/test/_refresh?pretty=1'

curl 'localhost:9200/test/foo/_search?pretty=1' -d '{
"query":{
"dis_max":{
"queries":[
{
"field": { "name": "Nick", "boost": 4.0 }
}
]
}
}
}'

kimchy · February 5, 2012, 11:21pm

For field query, you need to send the second format that allows for more options, see the second sample here: Elasticsearch Platform — Find real-time answers at scale | Elastic. I suggest you use the text query though, a bit faster and has more options unless you want to support the Lucene query syntax.

On Monday, February 6, 2012 at 12:02 AM, Nick Hoffman wrote:

Interesting. Thanks for that insight, kimchy.

I'm boosting a field query inside of a dis_max query, but ES is bailing. Is the "boost" option here not allowed?

curl -X DELETE 'localhost:9200/test?pretty=1'

curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Hoffman" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "John Smith" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Other" }'

curl -X POST 'localhost:9200/test/_refresh?pretty=1'

curl 'localhost:9200/test/foo/_search?pretty=1' -d '{
"query":{
"dis_max":{
"queries":[
{
"field": { "name": "Nick", "boost": 4.0 }
}
]
}
}
}'

Nick_Hoffman · February 6, 2012, 2:28am

Thanks again, kimchy. However, when I specify the "boost" option, the score
doesn't seem to change:

gist.github.com

https://gist.github.com/nickhoffman/51059b4774ec9d33249c

1_setup.txt

## In the 2 queries below, the documents have the same score, despite the 2nd query having a boost.

curl -X DELETE 'localhost:9200/test?pretty=1'

curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Hoffman" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "John Smith" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Other" }'

curl -X POST 'localhost:9200/test/_refresh?pretty=1'

2_query_with_no_boost.txt

curl 'localhost:9200/test/foo/_search?pretty=1&explain=0' -d '{
  "query":{
    "dis_max":{
      "queries":[
        { "text": { "name": { query: "Nick" } } }
      ]
    }
  }
}'

This file has been truncated. show original

3_query_with_boost.txt

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,

This file has been truncated. show original

cole · February 6, 2012, 11:37pm

Hi Nick and Shay,

I ran into the same issue with specifying boost values on text queries
a few days ago. As far as I can tell from digging through the mailing
list archives, the boost value should be respected, but I haven't dug
into the latest ES code to verify that. To unblock myself, I wrapped
the text query I wanted to boost with a custom_score query with an
associated boost value. Nick, I forked your previous gist to give a
simple custom_score example: Why does the "boost" not affect the score? · GitHub

Shay: I noticed the text query boost was at least showing up in the
"explain" output for some boost values when wrapped in a custom_score
query. The fifth entry
("5_queries_with_boost_and_custom_score_boost.txt") in the gist I
linked to shows two examples, the first of which doesn't have any sign
of the intended text query boost while the second at least has some
sign of it in the explain section.

Thanks,
Cole

On Feb 5, 6:28 pm, Nick Hoffman n...@deadorange.com wrote:

Thanks again, kimchy. However, when I specify the "boost" option, the score
doesn't seem to change:Why does the "boost" not affect the score? · GitHub

kimchy · February 7, 2012, 11:19am

Yes, this happens because the idf is 1 and the boosting gets normalized. We have a simply "take the score and multiple it by X" query called "custom_boost_factor", here is how you can use it: gist:1759189 · GitHub. But, for some reason its not documented in the site!, I will fix it shortly.

On Monday, February 6, 2012 at 4:28 AM, Nick Hoffman wrote:

Thanks again, kimchy. However, when I specify the "boost" option, the score doesn't seem to change:
Why does the "boost" not affect the score? · GitHub

cole · February 7, 2012, 6:47pm

That's great. Thanks, Shay. The custom_boost_factor query looks much
better than the custom_score query I was using with "script" :
"_score". =)

-cole

On Feb 7, 3:19 am, Shay Banon kim...@gmail.com wrote:

Yes, this happens because the idf is 1 and the boosting gets normalized. We have a simply "take the score and multiple it by X" query called "custom_boost_factor", here is how you can use it:gist:1759189 · GitHub. But, for some reason its not documented in the site!, I will fix it shortly.

On Monday, February 6, 2012 at 4:28 AM, Nick Hoffman wrote:

Thanks again, kimchy. However, when I specify the "boost" option, the score doesn't seem to change:
Why does the "boost" not affect the score? · GitHub

Topic		Replies	Views
Understanding Elasticsearch scoring between versions Elasticsearch	2	464	January 7, 2020
Help with complex query Elasticsearch	2	293	October 7, 2020
Tuning queries Elasticsearch	1	325	December 17, 2021
Optimizing a query that matches a large number of documents Elasticsearch	3	671	July 6, 2017
Advice about mapping Elasticsearch	3	327	July 6, 2017

Refactoring a search

Related topics