Boost, scoring and phrase search


(Yannick Smits-2) #1
  •      Can we have query time index boost levels?
    
  •      How would you implement formatting based boosting. For instance
    

have bold, italic and uppercase texts (HTML) have higher scores?

  •      I noticed phrase searches actually don't limit the search results
    

in to matches in 1 phrase. So when searching for "fear of dark" it will also
match if fear and dark appear in different sentences ("Then it became dark.
He had fear of cats."). Is there a way to force the result to be in the same
sentence?

  •      Is proximity of words also taken into account when doing a
    

multi-word phrase search (the closer words appear the higher the scoring?)

Thanks,
Yannick


(Yannick Smits-2) #2

Could somebody help me with my Monday questions? Even if you only know the
answer to 1 of them it would be very helpful.

Thanks,
Yannick

From: Yannick Smits [mailto:mailinglists@goyaweb.nl]
Sent: maandag 20 juni 2011 23:15
To: users@elasticsearch.com
Subject: boost, scoring and phrase search

  •      Can we have query time index boost levels?
    
  •      How would you implement formatting based boosting. For instance
    

have bold, italic and uppercase texts (HTML) have higher scores?

  •      I noticed phrase searches actually don't limit the search results
    

in to matches in 1 phrase. So when searching for "fear of dark" it will also
match if fear and dark appear in different sentences ("Then it became dark.
He had fear of cats."). Is there a way to force the result to be in the same
sentence?

  •      Is proximity of words also taken into account when doing a
    

multi-word phrase search (the closer words appear the higher the scoring?)

Thanks,
Yannick


(Clinton Gormley) #3

Hi Yannick

On Mon, 2011-06-20 at 23:15 +0200, Yannick Smits wrote:

  •      Can we have query time index boost levels?
    

Yes. Look at the Query DSL docs - most queries take a 'boost'
parameter.

  •      How would you implement formatting based boosting. For
    

instance have bold, italic and uppercase texts (HTML) have higher
scores?

I don't think you can (at least not currently)

  •      I noticed phrase searches actually don’t limit the search
    

results in to matches in 1 phrase. So when searching for “fear of
dark” it will also match if fear and dark appear in different
sentences (“Then it became dark. He had fear of cats.”). Is there a
way to force the result to be in the same sentence?

I think you mean a 'text' query, not a 'text_phrase' query. A text
query will match text that contains the same words. A text_phrase query
will find text that includes exactly the same phrase (ignoring stop
words like 'of').

You can set the slop factor for text_phrase queries so that the words
don't have to be right next to each other.

  •      Is proximity of words also taken into account when doing a
    

multi-word phrase search (the closer words appear the higher the
scoring?)

With the text_phrase query, and slop, yes: proximity is taken into
account.

clint


(Yannick Smits-2) #4

Hi Clinton,

  1. I had a look at the docs but could not find a way to specify a boost value based on indices the documents are in, at query time, only as a configuration/static (http://www.elasticsearch.org/guide/reference/api/search/index-boost.html). What am I missing?

  2. could you think of a strategy to simulate such a behavior? Like extracting the bold words and saving them with a higher boost to a different field or something without screwing up the highlighting feature?

  3. yes, I'm using text_phrase. But still I would like to know if it is possible to have it look only within the phrase for the specified terms instead of matching over multiple phrases.

Thanks,
Yannick

-----Original Message-----
From: Clinton Gormley [mailto:clinton@iannounce.co.uk]
Sent: woensdag 22 juni 2011 13:41
To: users@elasticsearch.com
Subject: Re: boost, scoring and phrase search

Hi Yannick

On Mon, 2011-06-20 at 23:15 +0200, Yannick Smits wrote:

  •      Can we have query time index boost levels?
    

Yes. Look at the Query DSL docs - most queries take a 'boost'
parameter.

  •      How would you implement formatting based boosting. For
    

instance have bold, italic and uppercase texts (HTML) have higher
scores?

I don't think you can (at least not currently)

  •      I noticed phrase searches actually don’t limit the search
    

results in to matches in 1 phrase. So when searching for “fear of
dark” it will also match if fear and dark appear in different
sentences (“Then it became dark. He had fear of cats.”). Is there a
way to force the result to be in the same sentence?

I think you mean a 'text' query, not a 'text_phrase' query. A text query will match text that contains the same words. A text_phrase query will find text that includes exactly the same phrase (ignoring stop words like 'of').

You can set the slop factor for text_phrase queries so that the words don't have to be right next to each other.

  •      Is proximity of words also taken into account when doing a
    

multi-word phrase search (the closer words appear the higher the
scoring?)

With the text_phrase query, and slop, yes: proximity is taken into account.

clint


(Clinton Gormley) #5

Hi Yannick

  1. I had a look at the docs but could not find a way to specify a
    boost value based on indices the documents are in, at query time, only
    as a configuration/static
    (http://www.elasticsearch.org/guide/reference/api/search/index-boost.html). What am I missing?

The page you link to above does not say that indices_boost is static -
it is a parameter that you can pass to any search query.

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"query" : {
"match_all" : {}
},
"indices_boost" : {
"index_foo" : 10,
"index_bar" : 1
}
}
'

  1. could you think of a strategy to simulate such a behavior? Like
    extracting the bold words and saving them with a higher boost to a
    different field or something without screwing up the highlighting
    feature?

This would require some work on the client side. You could have two
fields: 'text' and 'important_text'. 'text' would contain all of the
text, and 'important_text' just the bits inside the tags.

Then you can use a bool query to boost anything found in important_text,
but only do the highlighting on the 'text' field.

  1. yes, I'm using text_phrase. But still I would like to know if it is
    possible to have it look only within the phrase for the specified
    terms instead of matching over multiple phrases.

ES has no concept of sentences. I thought of possibly breaking up the
content into individual sentences, eg:

[ 'The quick brown fox', 'jumped over the lazy dog']

but it looks like ES just concatenates these values anyway:

[Wed Jun 22 16:22:35 2011] Protocol: http, Server: 192.168.5.103:9200

curl -XPOST 'http://127.0.0.1:9200/foo/bar?pretty=1' -d '
{
"text" : [
"The quick brown fox",
"jumped over the lazy dog"
]
}
'

[Wed Jun 22 16:22:35 2011] Response:

{

"ok" : true,

"_index" : "foo",

"_id" : "-Dt7zDUCQKauV69L_32w9g",

"_type" : "bar",

"_version" : 1

}

[Wed Jun 22 16:22:49 2011] Protocol: http, Server: 192.168.5.103:9200

curl -XGET 'http://127.0.0.1:9200/foo/_search?pretty=1' -d '
{
"query" : {
"text_phrase" : {
"text" : "brown jumped"
}
}
}
'

[Wed Jun 22 16:22:49 2011] Response:

{

"hits" : {

"hits" : [],

"max_score" : null,

"total" : 0

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 2

}

[Wed Jun 22 16:22:54 2011] Protocol: http, Server: 192.168.5.103:9200

curl -XGET 'http://127.0.0.1:9200/foo/_search?pretty=1' -d '
{
"query" : {
"text_phrase" : {
"text" : "fox jumped"
}
}
}
'

[Wed Jun 22 16:22:54 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"text" : [

"The quick brown fox",

"jumped over the lazy dog"

]

},

"_score" : 0.23013961,

"_index" : "foo",

"_id" : "-Dt7zDUCQKauV69L_32w9g",

"_type" : "bar"

}

],

"max_score" : 0.23013961,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 2

}


(system) #6