[Theory] Improving search result relevance?

I'm curious about some practical tips to improve search result relevance.
Currently, I'm tokenizing my fields with shingles and performing a simple
"text" search on the shingled field. I've found this gives better results
than other things I've tried (combinations of: terms, n-grams, phrase,
shingles). However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.

For example, if I search for "Servo Gear", it will match all documents with
either "Servo" or "Gear" and order them roughly based on frequency. There
is some preference to documents that say "Servo Gear" explicitly, but often
a document that lists "Gear" four times will rank higher simply because it
has the term more frequently. Ideally, something that matches the phrase
would rank higher.

So, how should I attack this problem? I'm thinking something like this:

  • Analyzers
    • Regular term tokenizer
    • Shingles, but turn off unigrams
  • Search both terms and shingles, but boost shingles so that phrase
    matches are sorted higher
  • Perhaps search using span_near so that non-exact phrases can be
    matched too? Would it be better to do something like a phrase query with
    slop instead?

Does that make sense? I understand ES well enough from a technical point
of view, but I'm having a hard time implementing more subtle search
algorithms that can surface the correct documents.

Thanks!
-Zach

--

Hi, all. I'm having a little bit different problem, but I guess in essence
it's the same.

I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.

Now my problem is that there's also 'Loreal Elseve 5' in search results,
i.e. elastic including in search results all entries with number 5 (and the
score is pretty high). How could I solve it?

I don't want to filter out all numbers at indexing phase, because they're very
useful in such a case when I search for keyword followed by number or
version.

On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:

I'm curious about some practical tips to improve search result relevance.
Currently, I'm tokenizing my fields with shingles and performing a simple
"text" search on the shingled field. I've found this gives better results
than other things I've tried (combinations of: terms, n-grams, phrase,
shingles). However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.

For example, if I search for "Servo Gear", it will match all documents
with either "Servo" or "Gear" and order them roughly based on frequency.
There is some preference to documents that say "Servo Gear" explicitly,
but often a document that lists "Gear" four times will rank higher simply
because it has the term more frequently. Ideally, something that matches
the phrase would rank higher.

So, how should I attack this problem? I'm thinking something like this:

  • Analyzers
    • Regular term tokenizer
    • Shingles, but turn off unigrams
  • Search both terms and shingles, but boost shingles so that phrase
    matches are sorted higher
  • Perhaps search using span_near so that non-exact phrases can be
    matched too? Would it be better to do something like a phrase query with
    slop instead?

Does that make sense? I understand ES well enough from a technical point
of view, but I'm having a hard time implementing more subtle search
algorithms that can surface the correct documents.

Thanks!
-Zach

--

On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:

Hi, all. I'm having a little bit different problem, but I guess in
essence it's the same.

I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.

Now my problem is that there's also 'Loreal Elseve 5' in search
results, i.e. elastic including in search results all entries with
number 5 (and the score is pretty high). How could I solve it?

You could try setting minimum_should_match to eg "60%"

clint

I don't want to filter out all numbers at indexing phase, because
they're very useful in such a case when I search for keyword followed
by number or version.

On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result
relevance. Currently, I'm tokenizing my fields with shingles
and performing a simple "text" search on the shingled field.
I've found this gives better results than other things I've
tried (combinations of: terms, n-grams, phrase, shingles).
However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.

    For example, if I search for "Servo Gear", it will match all
    documents with either "Servo" or "Gear" and order them roughly
    based on frequency.  There is some preference to documents
    that say "Servo Gear" explicitly, but often a document that
    lists "Gear" four times will rank higher simply because it has
    the term more frequently.  Ideally, something that matches the
    phrase would rank higher.
    
    
    So, how should I attack this problem?  I'm thinking something
    like this:
          * Analyzers
                  * Regular term tokenizer
                  * Shingles, but turn off unigrams
          * Search both terms and shingles, but boost shingles so
            that phrase matches are sorted higher
          * Perhaps search using span_near so that non-exact
            phrases can be matched too?  Would it be better to do
            something like a phrase query with slop instead?
    Does that make sense?  I understand ES well enough from a
    technical point of view, but I'm having a hard time
    implementing more subtle search algorithms that can surface
    the correct documents.
    
    
    Thanks!
    -Zach

--

Specifically regarding exact phrases getting ranked higher, I like using a
phrase boost technique and use a term based analyzer. This breaks down like:
(field:"test search")^PhraseBoostValue OR field:(test search)

Best Regards,
Paul

On Monday, January 28, 2013 8:58:27 AM UTC-7, Clinton Gormley wrote:

On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:

Hi, all. I'm having a little bit different problem, but I guess in
essence it's the same.

I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.

Now my problem is that there's also 'Loreal Elseve 5' in search
results, i.e. elastic including in search results all entries with
number 5 (and the score is pretty high). How could I solve it?

You could try setting minimum_should_match to eg "60%"

clint

I don't want to filter out all numbers at indexing phase, because
they're very useful in such a case when I search for keyword followed
by number or version.

On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result
relevance. Currently, I'm tokenizing my fields with shingles
and performing a simple "text" search on the shingled field.
I've found this gives better results than other things I've
tried (combinations of: terms, n-grams, phrase, shingles).
However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.

    For example, if I search for "Servo Gear", it will match all 
    documents with either "Servo" or "Gear" and order them roughly 
    based on frequency.  There is some preference to documents 
    that say "Servo Gear" explicitly, but often a document that 
    lists "Gear" four times will rank higher simply because it has 
    the term more frequently.  Ideally, something that matches the 
    phrase would rank higher. 
    
    
    So, how should I attack this problem?  I'm thinking something 
    like this: 
          * Analyzers 
                  * Regular term tokenizer 
                  * Shingles, but turn off unigrams 
          * Search both terms and shingles, but boost shingles so 
            that phrase matches are sorted higher 
          * Perhaps search using span_near so that non-exact 
            phrases can be matched too?  Would it be better to do 
            something like a phrase query with slop instead? 
    Does that make sense?  I understand ES well enough from a 
    technical point of view, but I'm having a hard time 
    implementing more subtle search algorithms that can surface 
    the correct documents. 
    
    
    Thanks! 
    -Zach 

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Clinton.

I tried, but i still keep getting any occurences of 5.

Anu other suggestions? I already use query_string fields boosting like "fields":
["title^2", "tags^2", "description"]

On Monday, January 28, 2013 9:58:27 PM UTC+6, Clinton Gormley wrote:

On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:

Hi, all. I'm having a little bit different problem, but I guess in
essence it's the same.

I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.

Now my problem is that there's also 'Loreal Elseve 5' in search
results, i.e. elastic including in search results all entries with
number 5 (and the score is pretty high). How could I solve it?

You could try setting minimum_should_match to eg "60%"

clint

I don't want to filter out all numbers at indexing phase, because
they're very useful in such a case when I search for keyword followed
by number or version.

On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result
relevance. Currently, I'm tokenizing my fields with shingles
and performing a simple "text" search on the shingled field.
I've found this gives better results than other things I've
tried (combinations of: terms, n-grams, phrase, shingles).
However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.

    For example, if I search for "Servo Gear", it will match all 
    documents with either "Servo" or "Gear" and order them roughly 
    based on frequency.  There is some preference to documents 
    that say "Servo Gear" explicitly, but often a document that 
    lists "Gear" four times will rank higher simply because it has 
    the term more frequently.  Ideally, something that matches the 
    phrase would rank higher. 
    
    
    So, how should I attack this problem?  I'm thinking something 
    like this: 
          * Analyzers 
                  * Regular term tokenizer 
                  * Shingles, but turn off unigrams 
          * Search both terms and shingles, but boost shingles so 
            that phrase matches are sorted higher 
          * Perhaps search using span_near so that non-exact 
            phrases can be matched too?  Would it be better to do 
            something like a phrase query with slop instead? 
    Does that make sense?  I understand ES well enough from a 
    technical point of view, but I'm having a hard time 
    implementing more subtle search algorithms that can surface 
    the correct documents. 
    
    
    Thanks! 
    -Zach 

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

On Wednesday, November 28, 2012 4:51:56 AM UTC+1, Zachary Tong wrote:

I'm curious about some practical tips to improve search result relevance.
Currently, I'm tokenizing my fields with shingles and performing a simple
"text" search on the shingled field. I've found this gives better results
than other things I've tried (combinations of: terms, n-grams, phrase,
shingles). However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.

For example, if I search for "Servo Gear", it will match all documents with

either "Servo" or "Gear" and order them roughly based on frequency. There
is some preference to documents that say "Servo Gear" explicitly, but often
a document that lists "Gear" four times will rank higher simply because it
has the term more frequently. Ideally, something that matches the phrase
would rank higher.

So, how should I attack this problem? I'm thinking something like this:

  • Analyzers
    • Regular term tokenizer
    • Shingles, but turn off unigrams
  • Search both terms and shingles, but boost shingles so that phrase
    matches are sorted higher
  • Perhaps search using span_near so that non-exact phrases can be
    matched too? Would it be better to do something like a phrase query with
    slop instead?

Does that make sense? I understand ES well enough from a technical point
of view, but I'm having a hard time implementing more subtle search
algorithms that can surface the correct documents.

Shingles are a good start here. I would personally index the shingles in a
dedicated field without unigrams and have a secondary field that doesn't
use shingles. That way you can boost the shingle field according to your
needs. I would also think about using a
DijunctionMaxQuery as the top-level query and for each sub query (one on
the shingle field and one on the unigram field) you use the
minimum_should_match syntax to donate when the query should produce a match.

simon

Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We implemented this sort of phrase-boosting for a client recently by shingling the query string outside elasticsearch and then adding the shingles as phrase queries in SHOULD clauses. So a search for 'annual leave entitlement' became:

"bool" : { "must" : { "query_string" : "annual leave entitlement" },
"should" : [ { "text" : { "type" : "phrase", "query" : "annual leave" } },
{ "text" : "type" : "phrase", "query" : "leave entitlement" }} ] }

Alan Woodward
www.flax.co.uk

On 29 Jan 2013, at 07:17, simonw wrote:

Hey,

On Wednesday, November 28, 2012 4:51:56 AM UTC+1, Zachary Tong wrote:
I'm curious about some practical tips to improve search result relevance. Currently, I'm tokenizing my fields with shingles and performing a simple "text" search on the shingled field. I've found this gives better results than other things I've tried (combinations of: terms, n-grams, phrase, shingles). However, search results leave something to be desired. I imagine there are ways to fix this...I just don't know how.

For example, if I search for "Servo Gear", it will match all documents with either "Servo" or "Gear" and order them roughly based on frequency. There is some preference to documents that say "Servo Gear" explicitly, but often a document that lists "Gear" four times will rank higher simply because it has the term more frequently. Ideally, something that matches the phrase would rank higher.

So, how should I attack this problem? I'm thinking something like this:
Analyzers
Regular term tokenizer
Shingles, but turn off unigrams
Search both terms and shingles, but boost shingles so that phrase matches are sorted higher
Perhaps search using span_near so that non-exact phrases can be matched too? Would it be better to do something like a phrase query with slop instead?
Does that make sense? I understand ES well enough from a technical point of view, but I'm having a hard time implementing more subtle search algorithms that can surface the correct documents.

Shingles are a good start here. I would personally index the shingles in a dedicated field without unigrams and have a secondary field that doesn't use shingles. That way you can boost the shingle field according to your needs. I would also think about using a
DijunctionMaxQuery as the top-level query and for each sub query (one on the shingle field and one on the unigram field) you use the minimum_should_match syntax to donate when the query should produce a match.

simon

Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My question is still open.
What is the most general solution to this?

I tried to query with use_dis_max, but it doen't change a lot.
I would try to set threshold on score, but every single occurence of number
5 in index have a score roughly the same as the most relevant results.

On Tuesday, January 29, 2013 10:19:33 AM UTC+6, Rauan Maemirov wrote:

Hi, Clinton.

I tried, but i still keep getting any occurences of 5.

Anu other suggestions? I already use query_string fields boosting like "fields":
["title^2", "tags^2", "description"]

On Monday, January 28, 2013 9:58:27 PM UTC+6, Clinton Gormley wrote:

On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:

Hi, all. I'm having a little bit different problem, but I guess in
essence it's the same.

I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.

Now my problem is that there's also 'Loreal Elseve 5' in search
results, i.e. elastic including in search results all entries with
number 5 (and the score is pretty high). How could I solve it?

You could try setting minimum_should_match to eg "60%"

clint

I don't want to filter out all numbers at indexing phase, because
they're very useful in such a case when I search for keyword followed
by number or version.

On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result
relevance. Currently, I'm tokenizing my fields with shingles
and performing a simple "text" search on the shingled field.
I've found this gives better results than other things I've
tried (combinations of: terms, n-grams, phrase, shingles).
However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.

    For example, if I search for "Servo Gear", it will match all 
    documents with either "Servo" or "Gear" and order them roughly 
    based on frequency.  There is some preference to documents 
    that say "Servo Gear" explicitly, but often a document that 
    lists "Gear" four times will rank higher simply because it has 
    the term more frequently.  Ideally, something that matches the 
    phrase would rank higher. 
    
    
    So, how should I attack this problem?  I'm thinking something 
    like this: 
          * Analyzers 
                  * Regular term tokenizer 
                  * Shingles, but turn off unigrams 
          * Search both terms and shingles, but boost shingles so 
            that phrase matches are sorted higher 
          * Perhaps search using span_near so that non-exact 
            phrases can be matched too?  Would it be better to do 
            something like a phrase query with slop instead? 
    Does that make sense?  I understand ES well enough from a 
    technical point of view, but I'm having a hard time 
    implementing more subtle search algorithms that can surface 
    the correct documents. 
    
    
    Thanks! 
    -Zach 

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.