We implemented this sort of phrase-boosting for a client recently by shingling the query string outside elasticsearch and then adding the shingles as phrase queries in SHOULD clauses. So a search for 'annual leave entitlement' became:
"bool" : { "must" : { "query_string" : "annual leave entitlement" },
"should" : [ { "text" : { "type" : "phrase", "query" : "annual leave" } },
{ "text" : "type" : "phrase", "query" : "leave entitlement" }} ] }
Alan Woodward
www.flax.co.uk
On 29 Jan 2013, at 07:17, simonw wrote:
Hey,
On Wednesday, November 28, 2012 4:51:56 AM UTC+1, Zachary Tong wrote:
I'm curious about some practical tips to improve search result relevance. Currently, I'm tokenizing my fields with shingles and performing a simple "text" search on the shingled field. I've found this gives better results than other things I've tried (combinations of: terms, n-grams, phrase, shingles). However, search results leave something to be desired. I imagine there are ways to fix this...I just don't know how.
For example, if I search for "Servo Gear", it will match all documents with either "Servo" or "Gear" and order them roughly based on frequency. There is some preference to documents that say "Servo Gear" explicitly, but often a document that lists "Gear" four times will rank higher simply because it has the term more frequently. Ideally, something that matches the phrase would rank higher.
So, how should I attack this problem? I'm thinking something like this:
Analyzers
Regular term tokenizer
Shingles, but turn off unigrams
Search both terms and shingles, but boost shingles so that phrase matches are sorted higher
Perhaps search using span_near so that non-exact phrases can be matched too? Would it be better to do something like a phrase query with slop instead?
Does that make sense? I understand ES well enough from a technical point of view, but I'm having a hard time implementing more subtle search algorithms that can surface the correct documents.
Shingles are a good start here. I would personally index the shingles in a dedicated field without unigrams and have a secondary field that doesn't use shingles. That way you can boost the shingle field according to your needs. I would also think about using a
DijunctionMaxQuery as the top-level query and for each sub query (one on the shingle field and one on the unigram field) you use the minimum_should_match syntax to donate when the query should produce a match.
simon
Thanks!
-Zach
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.