Fine-tuning search


(Clinton Gormley) #1

Hiya

I've indexed my docs with two 'keyword' fields:

  • name (with index boost of 1.3)
  • text

Also, the 'all' field is enabled.

Some of the docs have the name field filled in (in which case, this is
the most important field) and others not, in which case we just have the
keywords in the text field.

Here are four example results for a search on 'john smith', in
desecending order of relevance:



  1. Name: John Smith
    Text: John Smith passed away peacefully on March 20, aged 82.
    Funeral service will be held on Tuesday, April 3 in ....

  2. Name: John Smith
    Text: Passed away peacefully on March 20, aged 82. Funeral service
    will be held on Tuesday, April 3 in ....
    

  3. Name: ''
    Text: John Smith passed away peacefully on March 20, aged 82.
    Funeral service will be held on Tuesday, April 3 in ....

  4. Name: Maggie Smith
    Text: Maggie Smith passed away peacefully on March 20, aged 82.
    Sadly missed by husband John

A naive search for 'john smith' on the 'all' field favours doc (4) over
doc (3).

I'm trying to apply this logic, in descending order of importance:

  • all the words close together in the name field
  • all the words close together in the text field, if the doc
    doesn't have a name field
  • as many words as possible in the 'all' field

Does this query achieve that? Any way of improving it?

curl -XGET 'http://127.0.0.0:9200/ia_object/notice/_search?searchType=dfs_query_then_fetch' -d '
{
"sort" : [
"score"
],
"fields" : [],
"query" : {
"filteredQuery" : {
"filter" : {
"bool" : {
"must" : [
{
"term" : {
"status" : "active"
}
},
{
"term" : {
"location_id" : "23"
}
}
]
}
},
"query" : {
"disMax" : {
"tieBreaker" : "0.7",
"queries" : [
{
"queryString" : {
"fields" : [
"name"
],
"boost" : "1.3",
"query" : ""john smith"~4"
}
},
{
"filteredQuery" : {
"filter" : {
"term" : {
"has_name" : "0"
}
},
"query" : {
"queryString" : {
"fields" : [
"text"
],
"boost" : "1.5",
"query" : ""john smith"~4"
}
}
}
},
{
"queryString" : {
"boost" : 1,
"query" : "john smith"
}
}
]
}
}
}
},
"from" : 0,
"size" : "100"
}
'

thanks

Clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(Clinton Gormley) #2

I've implemented the query mentioned in my previous email, but it could
do with some improvement.

For example, when I search on 'Edward Roe', I get these two results
first, which looks correct:

    Alfred Edward Rowe South Shields: Obituary 
    
    Alfred Edward Rowe South Shields. Passed away on August 19.
    2009, aged 75. For our kind, gentle and always loving father.
    From your girls Debra, Janet and Carole. Respected father in law
    of Frank,...
    
    
    Obituary 
    
    ROWE Albert Edward Eveleigh Passed away suddenly on February 7,
    2007. Funeral service at Portchester Crematorium on Wednesday,
    February 28, 2007, at 1.00 p.m. No flowers please. Donations,
    if...

However, if I search for 'Edward Roe Crematorium' (the second doc is
more relevant), then the second doc is bumped down the list (ie is less
relevant)

Is this just a question of getting the boosts right, or are my queries
not achieving what I hope?

You can see the live search here (at least, for now):
http://es.iannounce.co.uk/?keywords=edward+rowe+crematorium&sub_type=&date_limit=&region=23&source=0&_fstatus=search

thanks

Clint

Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(Shay Banon) #3

Hi,

First two things. If you get the latest master, filteredQuery was renamed
to filtered. Also, there is a field query, that is the same as queryString
with one field (should make things a bit more readable).

I would actually play more with the tieBreaker, try and give it lower
values, so if it helps. In general, that the tweaking you do with Lucene to
try and nail the perfect matching. I would also play with adding another
query, a phrase query, with a slop of 3 or 4.

-shay.banon

On Wed, Mar 24, 2010 at 8:00 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

I've implemented the query mentioned in my previous email, but it could
do with some improvement.

For example, when I search on 'Edward Roe', I get these two results
first, which looks correct:

   Alfred Edward Rowe South Shields: Obituary

   Alfred Edward Rowe South Shields. Passed away on August 19.
   2009, aged 75. For our kind, gentle and always loving father.
   From your girls Debra, Janet and Carole. Respected father in law
   of Frank,...


   Obituary

   ROWE Albert Edward Eveleigh Passed away suddenly on February 7,
   2007. Funeral service at Portchester Crematorium on Wednesday,
   February 28, 2007, at 1.00 p.m. No flowers please. Donations,
   if...

However, if I search for 'Edward Roe Crematorium' (the second doc is
more relevant), then the second doc is bumped down the list (ie is less
relevant)

Is this just a question of getting the boosts right, or are my queries
not achieving what I hope?

You can see the live search here (at least, for now):

http://es.iannounce.co.uk/?keywords=edward+rowe+crematorium&sub_type=&date_limit=&region=23&source=0&_fstatus=search

thanks

Clint

Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(egaumer) #4

On Wed, Mar 24, 2010 at 2:00 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

I've implemented the query mentioned in my previous email, but it could
do with some improvement.

For example, when I search on 'Edward Roe', I get these two results
first, which looks correct:

   Alfred Edward Rowe South Shields: Obituary

   Alfred Edward Rowe South Shields. Passed away on August 19.
   2009, aged 75. For our kind, gentle and always loving father.
   From your girls Debra, Janet and Carole. Respected father in law
   of Frank,...


   Obituary

   ROWE Albert Edward Eveleigh Passed away suddenly on February 7,
   2007. Funeral service at Portchester Crematorium on Wednesday,
   February 28, 2007, at 1.00 p.m. No flowers please. Donations,
   if...

However, if I search for 'Edward Roe Crematorium' (the second doc is
more relevant), then the second doc is bumped down the list (ie is less
relevant)

Is this just a question of getting the boosts right, or are my queries
not achieving what I hope?

You can see the live search here (at least, for now):

http://es.iannounce.co.uk/?keywords=edward+rowe+crematorium&sub_type=&date_limit=&region=23&source=0&_fstatus=search

Essentially what you want is to boost documents where these words appear
closer together. At the same time you want matches in the name field to be
boosted above matches in the text field.

I would think something along the lines of:

"queryString" : {
    "fields" : ["name^5", "text"],
    "query" : "edward rowe crematorium",
    "phraseSlop" : "15",
    "useDisMax" : true
}

Would suffice.

I think your slop factor is too low (keep in mind this represents the max
allowable distance). Bump it up a bit because the closer your terms, the
higher the score (regardless of the max slop factor). In the link you
provided, 4 is too low to generate a valid match.

The example above should give higher scores to documents where all these
terms appear closer and at the same time, boost documents that have matches
in the name field.

I haven't studied the elasticsearch DSL in depth but this is general logic
you'd use to tune relevancy regardless of the query language you're using.

Start with weighting based on proximity (i.e. the closer the terms the
higher the score), then boost specific fields that are more relevant
(referred to as "context" weighting).

With Lucene, proximity is achieved via sloppy phrases and context weight is
achieved via DisjunctionMaxQuery.

Regards,
-Eric


(Shay Banon) #5

I was giving a low value for slop since I suggested to match it against the
name field (where long names are not probable). Note, a queryString of
"something else something" is not translated in lucene to a phrase
query but to a boolean OR/AND query (depending on the defaultOperator), to
do that, you need to do ""something else something"", but then you loose
options of the query parser, and you are probably better using the phrase
query if you want phrase queries. But, your idea is good in terms of
guidelines in what to try and achieve.

-shay.banon

On Wed, Mar 24, 2010 at 9:12 PM, Eric Gaumer egaumer@gmail.com wrote:

On Wed, Mar 24, 2010 at 2:00 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

I've implemented the query mentioned in my previous email, but it could
do with some improvement.

For example, when I search on 'Edward Roe', I get these two results
first, which looks correct:

   Alfred Edward Rowe South Shields: Obituary

   Alfred Edward Rowe South Shields. Passed away on August 19.
   2009, aged 75. For our kind, gentle and always loving father.
   From your girls Debra, Janet and Carole. Respected father in law
   of Frank,...


   Obituary

   ROWE Albert Edward Eveleigh Passed away suddenly on February 7,
   2007. Funeral service at Portchester Crematorium on Wednesday,
   February 28, 2007, at 1.00 p.m. No flowers please. Donations,
   if...

However, if I search for 'Edward Roe Crematorium' (the second doc is
more relevant), then the second doc is bumped down the list (ie is less
relevant)

Is this just a question of getting the boosts right, or are my queries
not achieving what I hope?

You can see the live search here (at least, for now):

http://es.iannounce.co.uk/?keywords=edward+rowe+crematorium&sub_type=&date_limit=&region=23&source=0&_fstatus=search

Essentially what you want is to boost documents where these words appear
closer together. At the same time you want matches in the name field to be
boosted above matches in the text field.

I would think something along the lines of:

"queryString" : {
    "fields" : ["name^5", "text"],
    "query" : "edward rowe crematorium",
    "phraseSlop" : "15",
    "useDisMax" : true
}

Would suffice.

I think your slop factor is too low (keep in mind this represents the max
allowable distance). Bump it up a bit because the closer your terms, the
higher the score (regardless of the max slop factor). In the link you
provided, 4 is too low to generate a valid match.

The example above should give higher scores to documents where all these
terms appear closer and at the same time, boost documents that have matches
in the name field.

I haven't studied the elasticsearch DSL in depth but this is general logic
you'd use to tune relevancy regardless of the query language you're using.

Start with weighting based on proximity (i.e. the closer the terms the
higher the score), then boost specific fields that are more relevant
(referred to as "context" weighting).

With Lucene, proximity is achieved via sloppy phrases and context weight is
achieved via DisjunctionMaxQuery.

Regards,
-Eric


(system) #6