We have an index with many quotes (sentences in English), and I want to allow users to find a quote by auto completion.
I have built an index just as mentioned here. Meaning that I have used standard tokenizer, followed by edge n-gram tokenizer within. In addition, the search analyzer is standard to allow searching for exact phrases and not only for separate words.
POST quotes/_search
{
"query":{
"match_phrase":{
"query":"business",
"field":"Sentence"
}
}
}
The auto completion is pretty good:
Match in any order: beginning, middle or end of the sentence.
Matching partial words: it keeps only documents that contain all of the searched terms, in the same positions relative to each other. It also ignores documents which doesn't match the phrase (irrelevant words / characters inside the phrase).
It returns a single document when the full quote is searched.
The performance of query time is excellent.
The only problem, as far as we noticed, is avoiding making precedence of documents that start earlier with the searched phrase. This is related to the score calculated by the match phrase query. A same problem is shown here:
For example, let's say the phrase searched is "business".
All of the following sentences will come back with nearly the same score:
business is all over the city as you know (1st word)
I love my business (4st word)
I would expect that the results will be the same order as the mentioned above - but it doesn't. It's not that the sentence that starts with the "business" word (the 1st sentence) comes back first in the results. Instead, many times all of the above sentences share the same score.
Is there a way to improve this? I see that the tokenizer is aware that business is the first word, and it's also the first word in the searched analyzer. So why not preferring the first sentence upon the others?
Maybe using script_score will help, as mentioned here? Will completion suggesters make a better job?
The matching position is not part of the default scoring so you'll need to add some custom logic to handle your case. You could for instance use two fields, one that use a edge ngram with a keyword tokenizer that would be use to match by prefix only (only if the field starts with the query) and another field like the one you described in your post. Then you could use these two fields in a boolean query like:
I think you can easily boost stuff that is truely at the beginning by using a second field with a keyword tokenizer and edge NGrams. If you really want to take the token position into account I think you need to pay a too much of a price for autocomplete and you'd also need to use payloads etc. and likely build your own query on the lucene level to do that.
I believe span queries could help here too, specifically span_first, span_near, and span_multi. span_first will match at or near the beginning of the field. span_near allows you to control order when there are multiple terms. span_multi can be used to wrap a prefix query, and can also wrap a fuzzy query if you want to handle misspellings. Span queries can also support stemming, but you need to analyze the query yourself before passing it to the span queries. It will be slower (don't know how much), but it allows more flexibility.
Using multiple span_first queries with increasing end values can be used to sort the results by start position.
Hey.
That's a quite interesting, though uneasy to maintain, solution. Our auto-complete index is pretty small: about 200 megabytes. And the number of words in each sentence is pretty limited - about 5-6 at max.
I think that the ultimate solution would be to write the appropriate score function. Are the acquired parameters needed for the script access-able? Here are the list of parameters.
I know you said that span queries would be difficult to maintain, but what if span_first was used as an additional relevance signal? Combine your existing query with span_first queries for the first search term only to automatically boost results where the phrase starts at the beginning of the field. This would also solve the problem where single word searches have the same score.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.