Hey guys. One of my queries is scoring documents strangely (IMO).
Obviously, this means that my query and/or mapping needs some work. Would
you mind giving me some advice on what type of query should be used here,
please?
The query is for my web app's generic "search" bar, and returns products
that match the search text.
Within each product document, I want to search through:
name
catalog.name
items.name
items.property_attribs.character.analyzed
_all
Is a dis_max query with field sub-queries ideal?
Here's a gist with more detail, because that usually helps:
Hard to tell what scores wrong. Did you try and boost search on fields are are more important if they match?
On Tuesday, January 31, 2012 at 9:44 AM, Nick Hoffman wrote:
Hey guys. One of my queries is scoring documents strangely (IMO). Obviously, this means that my query and/or mapping needs some work. Would you mind giving me some advice on what type of query should be used here, please?
The query is for my web app's generic "search" bar, and returns products that match the search text.
Within each product document, I want to search through:
On Tuesday, 31 January 2012 12:41:35 UTC-5, kimchy wrote:
Hard to tell what scores wrong. Did you try and boost search on fields
are are more important if they match?
Yeah, I'm boosting on the most important fields:
by 4 on "name"
by 2 on "items.name"
by 2 on "items.property_attribs.character.analyzed"
The 1st document in the gist has 1 occurrence of "Grimlock", while the 2nd
document has 7 occurrences of "Grimlock". Despite this, the 1st document is
scored higher than the 2nd document.
How can that be, considering that the 1st document matches on 1 field
that's boosted by 4, whereas the 2nd document matches on 1 field that's
boosted by 4, and 6 fields that're boosted by 2?
I just updated the gist with this info, and the query's explanation:
Thanks again for your help with this, Shay. I've spent hours trying to
figure this out, but haven't made any progress.
I did not see you boosting it in the query you sent, use it there...
On Tuesday, January 31, 2012 at 9:44 AM, Nick Hoffman wrote:
Hey guys. One of my queries is scoring documents strangely (IMO). Obviously, this means that my query and/or mapping needs some work. Would you mind giving me some advice on what type of query should be used here, please?
The query is for my web app's generic "search" bar, and returns products that match the search text.
Within each product document, I want to search through:
I have not spent a lot of time on it but glancing over the gist I noticed
the following: Your mapping has a boost of 4 for the *analyzed *version of
the top level name field. Your query runs on the not-analyzed version and
the boost will never kick in. Is this by intention ?
Your items.name is *always *analyzed via edge ngram (no separate analyzed
version). This generates many tokens that will match your user input
'Grimlock'. I would assume that these multiple hits on p1.items.name are
rated higher than the plain exact hit on p2.name (I am ignoring the other
fields matching for simplicity).
I have not spent a lot of time on it but glancing over the gist I noticed
the following: Your mapping has a boost of 4 for the *analyzed *version
of the top level name field. Your query runs on the not-analyzed version
and the boost will never kick in. Is this by intention ?
Your items.name is *always *analyzed via edge ngram (no separate analyzed
version). This generates many tokens that will match your user input
'Grimlock'. I would assume that these multiple hits on p1.items.name are
rated higher than the plain exact hit on p2.name (I am ignoring the other
fields matching for simplicity).
Ah, that makes sense. I've changed this to be:
"index_analyzer" : "ascii_edge_ngram",
"search_analyzer" : "ascii_std",
Now there's only 2 occurrences of "items.name" in the search's explanation,
but that's still one too many, right?
Also, I just noticed that the root-level "name" field isn't mentioned in
the search's explanation. Why would that be?
Use boosting on the query side, indexing time boosting is usually too restrictive.
On Wednesday, February 1, 2012 at 11:55 PM, Nick Hoffman wrote:
Hey Jan. Thanks for your help.
I have not spent a lot of time on it but glancing over the gist I noticed the following: Your mapping has a boost of 4 for the analyzed version of the top level name field. Your query runs on the not-analyzed version and the boost will never kick in. Is this by intention ?
Your items.name (http://items.name/) is always analyzed via edge ngram (no separate analyzed version). This generates many tokens that will match your user input 'Grimlock'. I would assume that these multiple hits on p1.items.name (http://p1.items.name/) are rated higher than the plain exact hit on p2.name (http://p2.name/) (I am ignoring the other fields matching for simplicity).
Ah, that makes sense. I've changed this to be:
"index_analyzer" : "ascii_edge_ngram",
"search_analyzer" : "ascii_std",
Now there's only 2 occurrences of "items.name (http://items.name)" in the search's explanation, but that's still one too many, right?
Also, I just noticed that the root-level "name" field isn't mentioned in the search's explanation. Why would that be?
When you specify boost values at index time, they get stored in the index, but with a reduced resolution. Most times, its much better, if possible, to provide them at query time, which gives you the flexibility of changing them on the fly without needing to reindex.
On Sunday, February 5, 2012 at 7:50 PM, Nick Hoffman wrote:
On Sunday, 5 February 2012 05:45:04 UTC-5, kimchy wrote:
Use boosting on the query side, indexing time boosting is usually too restrictive.
So boost values should be specified in queries instead of in mappings?
What's the difference, exactly? I looked around for info on this, but couldn't find anything.
When you specify boost values at index time, they get stored in the index, but with a reduced resolution. Most times, its much better, if possible, to provide them at query time, which gives you the flexibility of changing them on the fly without needing to reindex.
On Sunday, February 5, 2012 at 7:50 PM, Nick Hoffman wrote:
On Sunday, 5 February 2012 05:45:04 UTC-5, kimchy wrote:
Use boosting on the query side, indexing time boosting is usually too restrictive.
So boost values should be specified in queries instead of in mappings?
What's the difference, exactly? I looked around for info on this, but couldn't find anything.
For field query, you need to send the second format that allows for more options, see the second sample here: Elasticsearch Platform — Find real-time answers at scale | Elastic. I suggest you use the text query though, a bit faster and has more options unless you want to support the Lucene query syntax.
On Monday, February 6, 2012 at 12:02 AM, Nick Hoffman wrote:
Interesting. Thanks for that insight, kimchy.
I'm boosting a field query inside of a dis_max query, but ES is bailing. Is the "boost" option here not allowed?
curl -X DELETE 'localhost:9200/test?pretty=1'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Hoffman" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "John Smith" }'
curl -X POST 'localhost:9200/test/foo?pretty=1' -d '{ name: "Nick Other" }'
curl -X POST 'localhost:9200/test/_refresh?pretty=1'
I ran into the same issue with specifying boost values on text queries
a few days ago. As far as I can tell from digging through the mailing
list archives, the boost value should be respected, but I haven't dug
into the latest ES code to verify that. To unblock myself, I wrapped
the text query I wanted to boost with a custom_score query with an
associated boost value. Nick, I forked your previous gist to give a
simple custom_score example: Why does the "boost" not affect the score? · GitHub
Shay: I noticed the text query boost was at least showing up in the
"explain" output for some boost values when wrapped in a custom_score
query. The fifth entry
("5_queries_with_boost_and_custom_score_boost.txt") in the gist I
linked to shows two examples, the first of which doesn't have any sign
of the intended text query boost while the second at least has some
sign of it in the explain section.
Yes, this happens because the idf is 1 and the boosting gets normalized. We have a simply "take the score and multiple it by X" query called "custom_boost_factor", here is how you can use it: gist:1759189 · GitHub. But, for some reason its not documented in the site!, I will fix it shortly.
On Monday, February 6, 2012 at 4:28 AM, Nick Hoffman wrote:
Yes, this happens because the idf is 1 and the boosting gets normalized. We have a simply "take the score and multiple it by X" query called "custom_boost_factor", here is how you can use it:gist:1759189 · GitHub. But, for some reason its not documented in the site!, I will fix it shortly.
On Monday, February 6, 2012 at 4:28 AM, Nick Hoffman wrote:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.