Refactoring a search


(Nick Hoffman) #1

Hey guys. One of my queries is scoring documents strangely (IMO).
Obviously, this means that my query and/or mapping needs some work. Would
you mind giving me some advice on what type of query should be used here,
please?

The query is for my web app's generic "search" bar, and returns products
that match the search text.

Within each product document, I want to search through:

  • name
  • catalog.name
  • items.name
  • items.property_attribs.character.analyzed
  • _all

Is a dis_max query with field sub-queries ideal?

Here's a gist with more detail, because that usually helps:

Thanks again for your advice.
Nick


(Shay Banon) #2

Hard to tell what scores wrong. Did you try and boost search on fields are are more important if they match?

On Tuesday, January 31, 2012 at 9:44 AM, Nick Hoffman wrote:

Hey guys. One of my queries is scoring documents strangely (IMO). Obviously, this means that my query and/or mapping needs some work. Would you mind giving me some advice on what type of query should be used here, please?

The query is for my web app's generic "search" bar, and returns products that match the search text.

Within each product document, I want to search through:

Is a dis_max query with field sub-queries ideal?

Here's a gist with more detail, because that usually helps:
https://gist.github.com/df6321bdd0b6b5d599f8

Thanks again for your advice.
Nick


(Nick Hoffman) #3

On Tuesday, 31 January 2012 12:41:35 UTC-5, kimchy wrote:

Hard to tell what scores wrong. Did you try and boost search on fields
are are more important if they match?

Yeah, I'm boosting on the most important fields:

  • by 4 on "name"
  • by 2 on "items.name"
  • by 2 on "items.property_attribs.character.analyzed"

The 1st document in the gist has 1 occurrence of "Grimlock", while the 2nd
document has 7 occurrences of "Grimlock". Despite this, the 1st document is
scored higher than the 2nd document.

How can that be, considering that the 1st document matches on 1 field
that's boosted by 4, whereas the 2nd document matches on 1 field that's
boosted by 4, and 6 fields that're boosted by 2?

I just updated the gist with this info, and the query's explanation:

Thanks again for your help with this, Shay. I've spent hours trying to
figure this out, but haven't made any progress.


(Shay Banon) #4

I did not see you boosting it in the query you sent, use it there...

On Tuesday, January 31, 2012 at 9:44 AM, Nick Hoffman wrote:

Hey guys. One of my queries is scoring documents strangely (IMO). Obviously, this means that my query and/or mapping needs some work. Would you mind giving me some advice on what type of query should be used here, please?

The query is for my web app's generic "search" bar, and returns products that match the search text.

Within each product document, I want to search through:

Is a dis_max query with field sub-queries ideal?

Here's a gist with more detail, because that usually helps:
https://gist.github.com/df6321bdd0b6b5d599f8

Thanks again for your advice.
Nick


(Nick Hoffman) #5

On Wednesday, 1 February 2012 05:01:57 UTC-5, kimchy wrote:

I did not see you boosting it in the query you sent, use it there...

The mapping already boosts the fields, though. If I boost them in the
query, too, wouldn't that apply the boost twice?


(Jan Fiedler) #6

I have not spent a lot of time on it but glancing over the gist I noticed
the following: Your mapping has a boost of 4 for the *analyzed *version of
the top level name field. Your query runs on the not-analyzed version and
the boost will never kick in. Is this by intention ?

Your items.name is *always *analyzed via edge ngram (no separate analyzed
version). This generates many tokens that will match your user input
'Grimlock'. I would assume that these multiple hits on p1.items.name are
rated higher than the plain exact hit on p2.name (I am ignoring the other
fields matching for simplicity).


(Nick Hoffman) #7

Hey Jan. Thanks for your help.

I have not spent a lot of time on it but glancing over the gist I noticed
the following: Your mapping has a boost of 4 for the *analyzed *version
of the top level name field. Your query runs on the not-analyzed version
and the boost will never kick in. Is this by intention ?

Good catch. That was not intentional. I've swapped that around, and updated
the gist.
https://gist.github.com/df6321bdd0b6b5d599f8

Your items.name is *always *analyzed via edge ngram (no separate analyzed

version). This generates many tokens that will match your user input
'Grimlock'. I would assume that these multiple hits on p1.items.name are
rated higher than the plain exact hit on p2.name (I am ignoring the other
fields matching for simplicity).

Ah, that makes sense. I've changed this to be:
"index_analyzer" : "ascii_edge_ngram",
"search_analyzer" : "ascii_std",

Now there's only 2 occurrences of "items.name" in the search's explanation,
but that's still one too many, right?

Also, I just noticed that the root-level "name" field isn't mentioned in
the search's explanation. Why would that be?


(Shay Banon) #8

Use boosting on the query side, indexing time boosting is usually too restrictive.

On Wednesday, February 1, 2012 at 11:55 PM, Nick Hoffman wrote:

Hey Jan. Thanks for your help.

I have not spent a lot of time on it but glancing over the gist I noticed the following: Your mapping has a boost of 4 for the analyzed version of the top level name field. Your query runs on the not-analyzed version and the boost will never kick in. Is this by intention ?

Good catch. That was not intentional. I've swapped that around, and updated the gist.
https://gist.github.com/df6321bdd0b6b5d599f8

Your items.name (http://items.name/) is always analyzed via edge ngram (no separate analyzed version). This generates many tokens that will match your user input 'Grimlock'. I would assume that these multiple hits on p1.items.name (http://p1.items.name/) are rated higher than the plain exact hit on p2.name (http://p2.name/) (I am ignoring the other fields matching for simplicity).

Ah, that makes sense. I've changed this to be:
"index_analyzer" : "ascii_edge_ngram",
"search_analyzer" : "ascii_std",

Now there's only 2 occurrences of "items.name (http://items.name)" in the search's explanation, but that's still one too many, right?

Also, I just noticed that the root-level "name" field isn't mentioned in the search's explanation. Why would that be?


(Nick Hoffman) #9

On Sunday, 5 February 2012 05:45:04 UTC-5, kimchy wrote:

Use boosting on the query side, indexing time boosting is usually too
restrictive.

So boost values should be specified in queries instead of in mappings?

What's the difference, exactly? I looked around for info on this, but
couldn't find anything.

Thanks, kimchy!


(Shay Banon) #10

When you specify boost values at index time, they get stored in the index, but with a reduced resolution. Most times, its much better, if possible, to provide them at query time, which gives you the flexibility of changing them on the fly without needing to reindex.

On Sunday, February 5, 2012 at 7:50 PM, Nick Hoffman wrote:

On Sunday, 5 February 2012 05:45:04 UTC-5, kimchy wrote:

Use boosting on the query side, indexing time boosting is usually too restrictive.

So boost values should be specified in queries instead of in mappings?

What's the difference, exactly? I looked around for info on this, but couldn't find anything.

Thanks, kimchy!


(David Pilato) #11

Very interesting !
I will change my code tomorrow with this good advice !

David :wink:
@dadoonet

Le 5 févr. 2012 à 18:58, Shay Banon kimchy@gmail.com a écrit :

When you specify boost values at index time, they get stored in the index, but with a reduced resolution. Most times, its much better, if possible, to provide them at query time, which gives you the flexibility of changing them on the fly without needing to reindex.
On Sunday, February 5, 2012 at 7:50 PM, Nick Hoffman wrote:

On Sunday, 5 February 2012 05:45:04 UTC-5, kimchy wrote:

Use boosting on the query side, indexing time boosting is usually too restrictive.

So boost values should be specified in queries instead of in mappings?

What's the difference, exactly? I looked around for info on this, but couldn't find anything.

Thanks, kimchy!


(Nick Hoffman) #12

Interesting. Thanks for that insight, kimchy.

I'm boosting a field query inside of a dis_max query, but ES is bailing. Is
the "boost" option here not allowed?

curl -X DELETE 'localhost:9200/test?pretty=1'

curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Hoffman"
}'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "John Smith" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Other" }'

curl -X POST 'localhost:9200/test/_refresh?pretty=1'

curl 'localhost:9200/test/foo/_search?pretty=1' -d '{
"query":{
"dis_max":{
"queries":[
{
"field": { "name": "Nick", "boost": 4.0 }
}
]
}
}
}'


(Shay Banon) #13

For field query, you need to send the second format that allows for more options, see the second sample here: http://www.elasticsearch.org/guide/reference/query-dsl/field-query.html. I suggest you use the text query though, a bit faster and has more options unless you want to support the Lucene query syntax.

On Monday, February 6, 2012 at 12:02 AM, Nick Hoffman wrote:

Interesting. Thanks for that insight, kimchy.

I'm boosting a field query inside of a dis_max query, but ES is bailing. Is the "boost" option here not allowed?

curl -X DELETE 'localhost:9200/test?pretty=1'

curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Hoffman" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "John Smith" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Other" }'

curl -X POST 'localhost:9200/test/_refresh?pretty=1'

curl 'localhost:9200/test/foo/_search?pretty=1' -d '{
"query":{
"dis_max":{
"queries":[
{
"field": { "name": "Nick", "boost": 4.0 }
}
]
}
}
}'


(Nick Hoffman) #14

Thanks again, kimchy. However, when I specify the "boost" option, the score
doesn't seem to change:


(cole) #15

Hi Nick and Shay,

I ran into the same issue with specifying boost values on text queries
a few days ago. As far as I can tell from digging through the mailing
list archives, the boost value should be respected, but I haven't dug
into the latest ES code to verify that. To unblock myself, I wrapped
the text query I wanted to boost with a custom_score query with an
associated boost value. Nick, I forked your previous gist to give a
simple custom_score example: https://gist.github.com/4a446763fa0c3d6c2fb0

Shay: I noticed the text query boost was at least showing up in the
"explain" output for some boost values when wrapped in a custom_score
query. The fifth entry
("5_queries_with_boost_and_custom_score_boost.txt") in the gist I
linked to shows two examples, the first of which doesn't have any sign
of the intended text query boost while the second at least has some
sign of it in the explain section.

Thanks,
Cole

On Feb 5, 6:28 pm, Nick Hoffman n...@deadorange.com wrote:

Thanks again, kimchy. However, when I specify the "boost" option, the score
doesn't seem to change:https://gist.github.com/51059b4774ec9d33249c


(Shay Banon) #16

Yes, this happens because the idf is 1 and the boosting gets normalized. We have a simply "take the score and multiple it by X" query called "custom_boost_factor", here is how you can use it: https://gist.github.com/1759189. But, for some reason its not documented in the site!, I will fix it shortly.

On Monday, February 6, 2012 at 4:28 AM, Nick Hoffman wrote:

Thanks again, kimchy. However, when I specify the "boost" option, the score doesn't seem to change:
https://gist.github.com/51059b4774ec9d33249c


(cole) #17

That's great. Thanks, Shay. The custom_boost_factor query looks much
better than the custom_score query I was using with "script" :
"_score". =)

-cole

On Feb 7, 3:19 am, Shay Banon kim...@gmail.com wrote:

Yes, this happens because the idf is 1 and the boosting gets normalized. We have a simply "take the score and multiple it by X" query called "custom_boost_factor", here is how you can use it:https://gist.github.com/1759189. But, for some reason its not documented in the site!, I will fix it shortly.

On Monday, February 6, 2012 at 4:28 AM, Nick Hoffman wrote:

Thanks again, kimchy. However, when I specify the "boost" option, the score doesn't seem to change:
https://gist.github.com/51059b4774ec9d33249c


(system) #18