Problem with shingles as an autocomplete solution


#1

Hi All,

Still fairly new to Elasticsearch, but very impressed so far. Right now
I'm working on a place finder service that will access a repository of
place names. I'm attempting to build in some autocomplete functionality,
and while I've made significant progress, it's not perfect. My current
mapping on the given field for both index and search is based on the
following analyzer:

"analyzer_shingle" : { "tokenizer" : "standard", "filter" : [ "standard",
"lowercase", "filter_shingle"] }

where filter_shingle is defined as follows:

"filter_shingle" : { "type" : "shingle", "max_shingle_size" : 5,
"min_shingle_size" : 2, "output_unigrams" : "true }

I use this analyzer with a matchPhrasePrefixQuery, include a fuzziness of
0.8 and a maxExpansions of 30.

I also have a keyword analyzer which utilizes the matchPhrasePrefixQuery as
well, and is boosted so that fields that start with the entered value can
be boosted significantly

For the most part, this works great! I mean it really nails the search
every time and it's blazing fast.

So here's my issue, while this set up is working well, it fails if there
are any additional words after the phrase that aren't found in the actual
data. For instance, if I search for Goat, I get results like the following:

Goat
Goat Corral Flat
Goat Island
Goat Island Preserve Trail
Big Goat Road

Then if I search for "Goat Isla", I find a whole bunch of Goat Islands.

However, if I continue typing say, "Goat Island United States", the search
doesn't return any results. Now that bums me out for two reasons. On one
hand, this doesn't seem to make sense with the shingle filter, but maybe
i'm wrong. In my understanding, the shingle filter will make something
like the following tokens:

Goat
Goat Island
Goat Island United States
Island
Island United
United States

and so on and so forth...

Since all these tokens are passed into the search, and they are searching
on shingle tokenized data, then there should definitely be matches,
correct? "Goat Island" should still match some Goat Islands, and Island
should match a whole bunch of other things. Shouldn't I be finding data
here? Any thoughts on what I might be doing wrong. I would like to use
the United States part of the search in an additional query on another
field.

Thanks in advance for any help or direction!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/12003ba2-6c52-4ec5-83f6-45926a1a6551%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #2

Elliot, you're on the right path. However, the match_phrase_prefix query
will then look at the positions of your terms and find matches only when
all the analyzed terms are found as well as taking into account the exact
sequence of the terms. That explains why you're not matching like you
expect. If you change your query to a match query, you'll see immediately
that it will match like you expected. However, the match query does not to
prefix matches so you'll need to take care of that some other way (like for
example perhaps using an edge ngram filter).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1e0fb814-1334-465c-a5cd-84a538491def%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


#3

Binh,

You were right! That worked perfectly. I added a front edge ngram filter
along with a shingles filter and switched to the match query, and the
results have been great.

Is this an ideal set up for autocomplete? Do you have any other
suggestions? I combine my shingle+ngrams analyzer based query in a boolean
query with another query that uses a keyword analyzer on the index and a
shingle analyzer on search end, to increase the weighting of exact
matches. I would also like it if, in my search, it would weight earlier
terms higher than later terms (ie a search for Bear Creek would prefer
results with Bear over results with Creek). Any suggestions on this?

The Elasticsearch Api is pretty amazing when you get things working
correctly, but it sometimes feels like like a lot of trial and error.
Getting close though!

  • Elliott

On Wed, Jan 29, 2014 at 6:20 PM, Binh Ly binh@hibalo.com wrote:

Elliot, you're on the right path. However, the match_phrase_prefix query
will then look at the positions of your terms and find matches only when
all the analyzed terms are found as well as taking into account the exact
sequence of the terms. That explains why you're not matching like you
expect. If you change your query to a match query, you'll see immediately
that it will match like you expected. However, the match query does not to
prefix matches so you'll need to take care of that some other way (like for
example perhaps using an edge ngram filter).

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/boU5OEkJlsw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1e0fb814-1334-465c-a5cd-84a538491def%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCt%2BFsN-QG-sbMTXpuftYwm7yJxYZ2zs06_sP%3DYixNB4dkz1g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(chimpsarehungry) #4

Any response on if this is a good way to do autocomplete?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/37636f1f-f843-413e-b67d-2000030e0d34%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#5

Hi Shane,

While this initially seemed to be a good solution for my case, I have since
moved away from it. A shingles filter alone significantly increases the
size of an index. An n-gram filter only compounds this issue. Each
shingle is n-grammed, and to get decent results you have to allow for large
max n-grams. In my case, this led to an index that was too large,
resulting in slower queries.

I am still using the shingle filter for basic search. It works great when
a user is going to be searching for a portion of a string and you want
phrase-like matching capabilities.

On the other hand, I've found the Completion Suggester to be incredibly
fast and efficient for autocomplete. Have a look here:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html

Good luck!

On Wednesday, April 23, 2014 2:54:13 PM UTC-4, Shane Neeley wrote:

Any response on if this is a good way to do autocomplete?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/646f5c2c-5e75-4d1b-9f8f-a1ba49558f59%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6