I'm curious about some practical tips to improve search result relevance.
Currently, I'm tokenizing my fields with shingles and performing a simple
"text" search on the shingled field. I've found this gives better results
than other things I've tried (combinations of: terms, n-grams, phrase,
shingles). However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.
For example, if I search for "Servo Gear", it will match all documents with
either "Servo" or "Gear" and order them roughly based on frequency. There
is some preference to documents that say "Servo Gear" explicitly, but often
a document that lists "Gear" four times will rank higher simply because it
has the term more frequently. Ideally, something that matches the phrase
would rank higher.
So, how should I attack this problem? I'm thinking something like this:
Analyzers
Regular term tokenizer
Shingles, but turn off unigrams
Search both terms and shingles, but boost shingles so that phrase
matches are sorted higher
Perhaps search using span_near so that non-exact phrases can be
matched too? Would it be better to do something like a phrase query with
slop instead?
Does that make sense? I understand ES well enough from a technical point
of view, but I'm having a hard time implementing more subtle search
algorithms that can surface the correct documents.
Hi, all. I'm having a little bit different problem, but I guess in essence
it's the same.
I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.
Now my problem is that there's also 'Loreal Elseve 5' in search results,
i.e. elastic including in search results all entries with number 5 (and the
score is pretty high). How could I solve it?
I don't want to filter out all numbers at indexing phase, because they're very
useful in such a case when I search for keyword followed by number or
version.
On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result relevance.
Currently, I'm tokenizing my fields with shingles and performing a simple
"text" search on the shingled field. I've found this gives better results
than other things I've tried (combinations of: terms, n-grams, phrase,
shingles). However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.
For example, if I search for "Servo Gear", it will match all documents
with either "Servo" or "Gear" and order them roughly based on frequency.
There is some preference to documents that say "Servo Gear" explicitly,
but often a document that lists "Gear" four times will rank higher simply
because it has the term more frequently. Ideally, something that matches
the phrase would rank higher.
So, how should I attack this problem? I'm thinking something like this:
Analyzers
Regular term tokenizer
Shingles, but turn off unigrams
Search both terms and shingles, but boost shingles so that phrase
matches are sorted higher
Perhaps search using span_near so that non-exact phrases can be
matched too? Would it be better to do something like a phrase query with
slop instead?
Does that make sense? I understand ES well enough from a technical point
of view, but I'm having a hard time implementing more subtle search
algorithms that can surface the correct documents.
On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:
Hi, all. I'm having a little bit different problem, but I guess in
essence it's the same.
I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.
Now my problem is that there's also 'Loreal Elseve 5' in search
results, i.e. elastic including in search results all entries with
number 5 (and the score is pretty high). How could I solve it?
You could try setting minimum_should_match to eg "60%"
clint
I don't want to filter out all numbers at indexing phase, because
they're very useful in such a case when I search for keyword followed
by number or version.
On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result
relevance. Currently, I'm tokenizing my fields with shingles
and performing a simple "text" search on the shingled field.
I've found this gives better results than other things I've
tried (combinations of: terms, n-grams, phrase, shingles).
However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.
For example, if I search for "Servo Gear", it will match all
documents with either "Servo" or "Gear" and order them roughly
based on frequency. There is some preference to documents
that say "Servo Gear" explicitly, but often a document that
lists "Gear" four times will rank higher simply because it has
the term more frequently. Ideally, something that matches the
phrase would rank higher.
So, how should I attack this problem? I'm thinking something
like this:
* Analyzers
* Regular term tokenizer
* Shingles, but turn off unigrams
* Search both terms and shingles, but boost shingles so
that phrase matches are sorted higher
* Perhaps search using span_near so that non-exact
phrases can be matched too? Would it be better to do
something like a phrase query with slop instead?
Does that make sense? I understand ES well enough from a
technical point of view, but I'm having a hard time
implementing more subtle search algorithms that can surface
the correct documents.
Thanks!
-Zach
Specifically regarding exact phrases getting ranked higher, I like using a
phrase boost technique and use a term based analyzer. This breaks down like:
(field:"test search")^PhraseBoostValue OR field:(test search)
Best Regards,
Paul
On Monday, January 28, 2013 8:58:27 AM UTC-7, Clinton Gormley wrote:
On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:
Hi, all. I'm having a little bit different problem, but I guess in
essence it's the same.
I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.
Now my problem is that there's also 'Loreal Elseve 5' in search
results, i.e. elastic including in search results all entries with
number 5 (and the score is pretty high). How could I solve it?
You could try setting minimum_should_match to eg "60%"
clint
I don't want to filter out all numbers at indexing phase, because
they're very useful in such a case when I search for keyword followed
by number or version.
On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result
relevance. Currently, I'm tokenizing my fields with shingles
and performing a simple "text" search on the shingled field.
I've found this gives better results than other things I've
tried (combinations of: terms, n-grams, phrase, shingles).
However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.
For example, if I search for "Servo Gear", it will match all
documents with either "Servo" or "Gear" and order them roughly
based on frequency. There is some preference to documents
that say "Servo Gear" explicitly, but often a document that
lists "Gear" four times will rank higher simply because it has
the term more frequently. Ideally, something that matches the
phrase would rank higher.
So, how should I attack this problem? I'm thinking something
like this:
* Analyzers
* Regular term tokenizer
* Shingles, but turn off unigrams
* Search both terms and shingles, but boost shingles so
that phrase matches are sorted higher
* Perhaps search using span_near so that non-exact
phrases can be matched too? Would it be better to do
something like a phrase query with slop instead?
Does that make sense? I understand ES well enough from a
technical point of view, but I'm having a hard time
implementing more subtle search algorithms that can surface
the correct documents.
Thanks!
-Zach
I tried, but i still keep getting any occurences of 5.
Anu other suggestions? I already use query_string fields boosting like "fields":
["title^2", "tags^2", "description"]
On Monday, January 28, 2013 9:58:27 PM UTC+6, Clinton Gormley wrote:
On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:
Hi, all. I'm having a little bit different problem, but I guess in
essence it's the same.
I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.
Now my problem is that there's also 'Loreal Elseve 5' in search
results, i.e. elastic including in search results all entries with
number 5 (and the score is pretty high). How could I solve it?
You could try setting minimum_should_match to eg "60%"
clint
I don't want to filter out all numbers at indexing phase, because
they're very useful in such a case when I search for keyword followed
by number or version.
On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result
relevance. Currently, I'm tokenizing my fields with shingles
and performing a simple "text" search on the shingled field.
I've found this gives better results than other things I've
tried (combinations of: terms, n-grams, phrase, shingles).
However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.
For example, if I search for "Servo Gear", it will match all
documents with either "Servo" or "Gear" and order them roughly
based on frequency. There is some preference to documents
that say "Servo Gear" explicitly, but often a document that
lists "Gear" four times will rank higher simply because it has
the term more frequently. Ideally, something that matches the
phrase would rank higher.
So, how should I attack this problem? I'm thinking something
like this:
* Analyzers
* Regular term tokenizer
* Shingles, but turn off unigrams
* Search both terms and shingles, but boost shingles so
that phrase matches are sorted higher
* Perhaps search using span_near so that non-exact
phrases can be matched too? Would it be better to do
something like a phrase query with slop instead?
Does that make sense? I understand ES well enough from a
technical point of view, but I'm having a hard time
implementing more subtle search algorithms that can surface
the correct documents.
Thanks!
-Zach
On Wednesday, November 28, 2012 4:51:56 AM UTC+1, Zachary Tong wrote:
I'm curious about some practical tips to improve search result relevance.
Currently, I'm tokenizing my fields with shingles and performing a simple
"text" search on the shingled field. I've found this gives better results
than other things I've tried (combinations of: terms, n-grams, phrase,
shingles). However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.
For example, if I search for "Servo Gear", it will match all documents with
either "Servo" or "Gear" and order them roughly based on frequency. There
is some preference to documents that say "Servo Gear" explicitly, but often
a document that lists "Gear" four times will rank higher simply because it
has the term more frequently. Ideally, something that matches the phrase
would rank higher.
So, how should I attack this problem? I'm thinking something like this:
Analyzers
Regular term tokenizer
Shingles, but turn off unigrams
Search both terms and shingles, but boost shingles so that phrase
matches are sorted higher
Perhaps search using span_near so that non-exact phrases can be
matched too? Would it be better to do something like a phrase query with
slop instead?
Does that make sense? I understand ES well enough from a technical point
of view, but I'm having a hard time implementing more subtle search
algorithms that can surface the correct documents.
Shingles are a good start here. I would personally index the shingles in a
dedicated field without unigrams and have a secondary field that doesn't
use shingles. That way you can boost the shingle field according to your
needs. I would also think about using a
DijunctionMaxQuery as the top-level query and for each sub query (one on
the shingle field and one on the unigram field) you use the
minimum_should_match syntax to donate when the query should produce a match.
We implemented this sort of phrase-boosting for a client recently by shingling the query string outside elasticsearch and then adding the shingles as phrase queries in SHOULD clauses. So a search for 'annual leave entitlement' became:
On Wednesday, November 28, 2012 4:51:56 AM UTC+1, Zachary Tong wrote:
I'm curious about some practical tips to improve search result relevance. Currently, I'm tokenizing my fields with shingles and performing a simple "text" search on the shingled field. I've found this gives better results than other things I've tried (combinations of: terms, n-grams, phrase, shingles). However, search results leave something to be desired. I imagine there are ways to fix this...I just don't know how.
For example, if I search for "Servo Gear", it will match all documents with either "Servo" or "Gear" and order them roughly based on frequency. There is some preference to documents that say "Servo Gear" explicitly, but often a document that lists "Gear" four times will rank higher simply because it has the term more frequently. Ideally, something that matches the phrase would rank higher.
So, how should I attack this problem? I'm thinking something like this:
Analyzers
Regular term tokenizer
Shingles, but turn off unigrams
Search both terms and shingles, but boost shingles so that phrase matches are sorted higher
Perhaps search using span_near so that non-exact phrases can be matched too? Would it be better to do something like a phrase query with slop instead?
Does that make sense? I understand ES well enough from a technical point of view, but I'm having a hard time implementing more subtle search algorithms that can surface the correct documents.
Shingles are a good start here. I would personally index the shingles in a dedicated field without unigrams and have a secondary field that doesn't use shingles. That way you can boost the shingle field according to your needs. I would also think about using a
DijunctionMaxQuery as the top-level query and for each sub query (one on the shingle field and one on the unigram field) you use the minimum_should_match syntax to donate when the query should produce a match.
My question is still open.
What is the most general solution to this?
I tried to query with use_dis_max, but it doen't change a lot.
I would try to set threshold on score, but every single occurence of number
5 in index have a score roughly the same as the most relevant results.
On Tuesday, January 29, 2013 10:19:33 AM UTC+6, Rauan Maemirov wrote:
Hi, Clinton.
I tried, but i still keep getting any occurences of 5.
Anu other suggestions? I already use query_string fields boosting like "fields":
["title^2", "tags^2", "description"]
On Monday, January 28, 2013 9:58:27 PM UTC+6, Clinton Gormley wrote:
On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:
Hi, all. I'm having a little bit different problem, but I guess in
essence it's the same.
I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
'iphone 4s', etc.
Now my problem is that there's also 'Loreal Elseve 5' in search
results, i.e. elastic including in search results all entries with
number 5 (and the score is pretty high). How could I solve it?
You could try setting minimum_should_match to eg "60%"
clint
I don't want to filter out all numbers at indexing phase, because
they're very useful in such a case when I search for keyword followed
by number or version.
On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result
relevance. Currently, I'm tokenizing my fields with shingles
and performing a simple "text" search on the shingled field.
I've found this gives better results than other things I've
tried (combinations of: terms, n-grams, phrase, shingles).
However, search results leave something to be desired. I
imagine there are ways to fix this...I just don't know how.
For example, if I search for "Servo Gear", it will match all
documents with either "Servo" or "Gear" and order them roughly
based on frequency. There is some preference to documents
that say "Servo Gear" explicitly, but often a document that
lists "Gear" four times will rank higher simply because it has
the term more frequently. Ideally, something that matches the
phrase would rank higher.
So, how should I attack this problem? I'm thinking something
like this:
* Analyzers
* Regular term tokenizer
* Shingles, but turn off unigrams
* Search both terms and shingles, but boost shingles so
that phrase matches are sorted higher
* Perhaps search using span_near so that non-exact
phrases can be matched too? Would it be better to do
something like a phrase query with slop instead?
Does that make sense? I understand ES well enough from a
technical point of view, but I'm having a hard time
implementing more subtle search algorithms that can surface
the correct documents.
Thanks!
-Zach
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.