Auto suggest with elasticsearch


(Alexander Reelsen) #1

Hi there,

I am currently trying to find the most elegant way to implement auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or to
use a specific auto completion index for this? To keep the returned
data as small as possible I do not want to return all the product data
with each letter typed.

I am imagining an autocompletion solution like this and would love to
get some input whether this is useful at all and could be implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo baz" ],
    something: "else"
    }

So, would something like this be useful and can it be automatically
created out of the product feed (or any feed to be more generic) to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


(Stephane Bastian) #2

Hello,

On Mon, 2011-07-04 at 00:15 -0700, Alexander Reelsen wrote:

Hi there,

I am currently trying to find the most elegant way to implement auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or to
use a specific auto completion index for this? To keep the returned
data as small as possible I do not want to return all the product data
with each letter typed.

I am imagining an autocompletion solution like this and would love to
get some input whether this is useful at all and could be implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo baz" ],
    something: "else"
    }

ES is definitely a good candidate to provide auto-suggest type
fonctionality.
IMHO the first thing you've got to focus-on is to decide the type of
auto-suggest you are looking for.
In your example ("foob", "fooba", "foobar"...), the auto-suggest seems
to be prefix-based. But is this always the case?
Also, what do you expect if someone enters a word which does not match
the prefix ? (for instance the user enters "francisco", do you expect
"san francisco" to be displayed?)
Another thing to consider is the expected behavior with sentences.
On top of my head you also should decide whether or not we'll need to
sort or highlight the results.

Based on this, you can make the right decision as to the most
appropriate combination of Tokenizer/Filters/Queries to use.

If you want to implement a quick autocompletion I suggest that you look
at the "text_phrase_prefix" query. it should give you a good head-start
http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

Hope this helps.

Since auto-suggest seems to be popular, how about if we enhance ES doc
with the most common auto-suggest use cases and solutions? (with pros
and cons, snippets of code, tokenizer/filter/query to use and such).
If other people are interested, I can certainly contribute to it

So, would something like this be useful and can it be automatically
created out of the product feed (or any feed to be more generic) to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


(Weiwei Wang) #3

auto suggest can be implemented by add a field tokenized by
StandardTokenizer(or other) and filter by EdgeNGramFilter or
NGramFilter according to your project requirement.

You can look into the lucene contrib Spellchecker for some insight
understanding.

and this article http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
should be helpful for understanding search suggestion.

On Jul 4, 4:01 pm, stephane stephane.bastian....@gmail.com wrote:

Hello,

On Mon, 2011-07-04 at 00:15 -0700, Alexander Reelsen wrote:

Hi there,

I am currently trying to find the most elegant way to implement auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or to
use a specific auto completion index for this? To keep the returned
data as small as possible I do not want to return all the product data
with each letter typed.

I am imagining an autocompletion solution like this and would love to
get some input whether this is useful at all and could be implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo baz" ],
    something: "else"
    }

ES is definitely a good candidate to provide auto-suggest type
fonctionality.
IMHO the first thing you've got to focus-on is to decide the type of
auto-suggest you are looking for.
In your example ("foob", "fooba", "foobar"...), the auto-suggest seems
to be prefix-based. But is this always the case?
Also, what do you expect if someone enters a word which does not match
the prefix ? (for instance the user enters "francisco", do you expect
"san francisco" to be displayed?)
Another thing to consider is the expected behavior with sentences.
On top of my head you also should decide whether or not we'll need to
sort or highlight the results.

Based on this, you can make the right decision as to the most
appropriate combination of Tokenizer/Filters/Queries to use.

If you want to implement a quick autocompletion I suggest that you look
at the "text_phrase_prefix" query. it should give you a good head-starthttp://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

Hope this helps.

Since auto-suggest seems to be popular, how about if we enhance ES doc
with the most common auto-suggest use cases and solutions? (with pros
and cons, snippets of code, tokenizer/filter/query to use and such).
If other people are interested, I can certainly contribute to it

So, would something like this be useful and can it be automatically
created out of the product feed (or any feed to be more generic) to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


(Michael McCandless) #4

Also, as of Lucene 3.3, the spellchecker modules
(lucene/contrib/spellchecker) now has 3 implementations for
auto-suggest, so in theory ES can just expose these?

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 4:37 AM, Weiwei Wang ww.wang.cs@gmail.com wrote:

auto suggest can be implemented by add a field tokenized by
StandardTokenizer(or other) and filter by EdgeNGramFilter or
NGramFilter according to your project requirement.

You can look into the lucene contrib Spellchecker for some insight
understanding.

and this article http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
should be helpful for understanding search suggestion.

On Jul 4, 4:01 pm, stephane stephane.bastian....@gmail.com wrote:

Hello,

On Mon, 2011-07-04 at 00:15 -0700, Alexander Reelsen wrote:

Hi there,

I am currently trying to find the most elegant way to implement auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or to
use a specific auto completion index for this? To keep the returned
data as small as possible I do not want to return all the product data
with each letter typed.

I am imagining an autocompletion solution like this and would love to
get some input whether this is useful at all and could be implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo baz" ],
    something: "else"
    }

ES is definitely a good candidate to provide auto-suggest type
fonctionality.
IMHO the first thing you've got to focus-on is to decide the type of
auto-suggest you are looking for.
In your example ("foob", "fooba", "foobar"...), the auto-suggest seems
to be prefix-based. But is this always the case?
Also, what do you expect if someone enters a word which does not match
the prefix ? (for instance the user enters "francisco", do you expect
"san francisco" to be displayed?)
Another thing to consider is the expected behavior with sentences.
On top of my head you also should decide whether or not we'll need to
sort or highlight the results.

Based on this, you can make the right decision as to the most
appropriate combination of Tokenizer/Filters/Queries to use.

If you want to implement a quick autocompletion I suggest that you look
at the "text_phrase_prefix" query. it should give you a good head-starthttp://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

Hope this helps.

Since auto-suggest seems to be popular, how about if we enhance ES doc
with the most common auto-suggest use cases and solutions? (with pros
and cons, snippets of code, tokenizer/filter/query to use and such).
If other people are interested, I can certainly contribute to it

So, would something like this be useful and can it be automatically
created out of the product feed (or any feed to be more generic) to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


(Shay Banon) #5

Hi,

Yea, doing auto suggest can be done in several manners. The first is simply executed a query against the relevant field, and only getting it back. That field can be analyzed (for example, using ngrams) if it make sense for the auto suggestion.

@mike: I saw the suggest module added, the problem with that is the fact that it requires a full rebuild and can't be dynamically updated to reflect changes in the index. This requires a system where periodic rebuilds are done, and personally, not a big fan of that :). It can become expensive to rebuild (file system cache invalidation), non intuitive (where it lacks real or near real timeness). Though, the FST based one is cool :). If an in memory auto suggest is required on a field, I was thinking of a non blocking trie based data structure (derived from ConcurrentSkipList), but that requires work :slight_smile:

-shay.banon

On Monday, July 4, 2011 at 3:49 PM, Michael McCandless wrote:

Also, as of Lucene 3.3, the spellchecker modules
(lucene/contrib/spellchecker) now has 3 implementations for
auto-suggest, so in theory ES can just expose these?

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 4:37 AM, Weiwei Wang <ww.wang.cs@gmail.com (mailto:ww.wang.cs@gmail.com)> wrote:

auto suggest can be implemented by add a field tokenized by
StandardTokenizer(or other) and filter by EdgeNGramFilter or
NGramFilter according to your project requirement.

You can look into the lucene contrib Spellchecker for some insight
understanding.

and this article http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
should be helpful for understanding search suggestion.

On Jul 4, 4:01 pm, stephane <stephane.bastian....@gmail.com (http://gmail.com)> wrote:

Hello,

On Mon, 2011-07-04 at 00:15 -0700, Alexander Reelsen wrote:

Hi there,

I am currently trying to find the most elegant way to implement auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or to
use a specific auto completion index for this? To keep the returned
data as small as possible I do not want to return all the product data
with each letter typed.

I am imagining an autocompletion solution like this and would love to
get some input whether this is useful at all and could be implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo baz" ],
    something: "else"
    }

ES is definitely a good candidate to provide auto-suggest type
fonctionality.
IMHO the first thing you've got to focus-on is to decide the type of
auto-suggest you are looking for.
In your example ("foob", "fooba", "foobar"...), the auto-suggest seems
to be prefix-based. But is this always the case?
Also, what do you expect if someone enters a word which does not match
the prefix ? (for instance the user enters "francisco", do you expect
"san francisco" to be displayed?)
Another thing to consider is the expected behavior with sentences.
On top of my head you also should decide whether or not we'll need to
sort or highlight the results.

Based on this, you can make the right decision as to the most
appropriate combination of Tokenizer/Filters/Queries to use.

If you want to implement a quick autocompletion I suggest that you look
at the "text_phrase_prefix" query. it should give you a good head-starthttp://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

Hope this helps.

Since auto-suggest seems to be popular, how about if we enhance ES doc
with the most common auto-suggest use cases and solutions? (with pros
and cons, snippets of code, tokenizer/filter/query to use and such).
If other people are interested, I can certainly contribute to it

So, would something like this be useful and can it be automatically
created out of the product feed (or any feed to be more generic) to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


(Michael McCandless) #6

Yeah I agree having to fully rebuild the suggest data structures
periodically isn't great.

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 3:33 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Hi,
Yea, doing auto suggest can be done in several manners. The first is
simply executed a query against the relevant field, and only getting it
back. That field can be analyzed (for example, using ngrams) if it make
sense for the auto suggestion.
@mike: I saw the suggest module added, the problem with that is the fact
that it requires a full rebuild and can't be dynamically updated to reflect
changes in the index. This requires a system where periodic rebuilds are
done, and personally, not a big fan of that :). It can become expensive to
rebuild (file system cache invalidation), non intuitive (where it lacks real
or near real timeness). Though, the FST based one is cool :). If an in
memory auto suggest is required on a field, I was thinking of a non blocking
trie based data structure (derived from ConcurrentSkipList), but that
requires work :slight_smile:
-shay.banon

On Monday, July 4, 2011 at 3:49 PM, Michael McCandless wrote:

Also, as of Lucene 3.3, the spellchecker modules
(lucene/contrib/spellchecker) now has 3 implementations for
auto-suggest, so in theory ES can just expose these?

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 4:37 AM, Weiwei Wang ww.wang.cs@gmail.com wrote:

auto suggest can be implemented by add a field tokenized by
StandardTokenizer(or other) and filter by EdgeNGramFilter or
NGramFilter according to your project requirement.

You can look into the lucene contrib Spellchecker for some insight
understanding.

and this article
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
should be helpful for understanding search suggestion.

On Jul 4, 4:01 pm, stephane stephane.bastian....@gmail.com wrote:

Hello,

On Mon, 2011-07-04 at 00:15 -0700, Alexander Reelsen wrote:

Hi there,

I am currently trying to find the most elegant way to implement auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or to
use a specific auto completion index for this? To keep the returned
data as small as possible I do not want to return all the product data
with each letter typed.

I am imagining an autocompletion solution like this and would love to
get some input whether this is useful at all and could be implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo baz" ],
    something: "else"
    }

ES is definitely a good candidate to provide auto-suggest type
fonctionality.
IMHO the first thing you've got to focus-on is to decide the type of
auto-suggest you are looking for.
In your example ("foob", "fooba", "foobar"...), the auto-suggest seems
to be prefix-based. But is this always the case?
Also, what do you expect if someone enters a word which does not match
the prefix ? (for instance the user enters "francisco", do you expect
"san francisco" to be displayed?)
Another thing to consider is the expected behavior with sentences.
On top of my head you also should decide whether or not we'll need to
sort or highlight the results.

Based on this, you can make the right decision as to the most
appropriate combination of Tokenizer/Filters/Queries to use.

If you want to implement a quick autocompletion I suggest that you look
at the "text_phrase_prefix" query. it should give you a good
head-starthttp://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

Hope this helps.

Since auto-suggest seems to be popular, how about if we enhance ES doc
with the most common auto-suggest use cases and solutions? (with pros
and cons, snippets of code, tokenizer/filter/query to use and such).
If other people are interested, I can certainly contribute to it

So, would something like this be useful and can it be automatically
created out of the product feed (or any feed to be more generic) to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


#7

But how much of a big deal is this in practice?

I suspect that most serious autosuggest users are populating their
suggest, not directly from the terms, but based on query logs
(probably with a lot of processing too!).
It may be the case query logs are also in a lucene index, in which
case the ability for suggest to pull from an index's terms is
convenient, but I'm really suspicious if you can get a nice suggest
based on the individual terms of the index itself.

I do happen to think you can get by ok this way for spellchecking,
because its an easier problem: adding context gets you the "final
inch" or last 5% or so, but doing it word-by-word is still ok.

all this being said, it might be possible to implement in a manner
similar to lucene 4.0's directspellchecker, but performance will be
tricky.
in https://issues.apache.org/jira/browse/LUCENE-2507, I benchmarked
the initial patch and it was unacceptable... only after Mike rewrote
the terms dictionary one or two times did we start to see acceptable
performance.

for suggest, the user might literally be waiting on this thing for
every keystroke: and that scares me. maybe if we added a 'direct
suggester' the perf would be ok for lots of people, but maybe it would
require specialized data structures (e.g. terms index/dict impls) to
get ok performance... but I'm personally a little discouraged from
looking at this myself, because if ultimately you cannot get good
quality from the index itself, it could be a big waste of time.

On Mon, Jul 4, 2011 at 6:37 PM, Michael McCandless
mail@mikemccandless.com wrote:

Yeah I agree having to fully rebuild the suggest data structures
periodically isn't great.

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 3:33 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Hi,
Yea, doing auto suggest can be done in several manners. The first is
simply executed a query against the relevant field, and only getting it
back. That field can be analyzed (for example, using ngrams) if it make
sense for the auto suggestion.
@mike: I saw the suggest module added, the problem with that is the fact
that it requires a full rebuild and can't be dynamically updated to reflect
changes in the index. This requires a system where periodic rebuilds are
done, and personally, not a big fan of that :). It can become expensive to
rebuild (file system cache invalidation), non intuitive (where it lacks real
or near real timeness). Though, the FST based one is cool :). If an in
memory auto suggest is required on a field, I was thinking of a non blocking
trie based data structure (derived from ConcurrentSkipList), but that
requires work :slight_smile:
-shay.banon

On Monday, July 4, 2011 at 3:49 PM, Michael McCandless wrote:

Also, as of Lucene 3.3, the spellchecker modules
(lucene/contrib/spellchecker) now has 3 implementations for
auto-suggest, so in theory ES can just expose these?

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 4:37 AM, Weiwei Wang ww.wang.cs@gmail.com wrote:

auto suggest can be implemented by add a field tokenized by
StandardTokenizer(or other) and filter by EdgeNGramFilter or
NGramFilter according to your project requirement.

You can look into the lucene contrib Spellchecker for some insight
understanding.

and this article
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
should be helpful for understanding search suggestion.

On Jul 4, 4:01 pm, stephane stephane.bastian....@gmail.com wrote:

Hello,

On Mon, 2011-07-04 at 00:15 -0700, Alexander Reelsen wrote:

Hi there,

I am currently trying to find the most elegant way to implement auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or to
use a specific auto completion index for this? To keep the returned
data as small as possible I do not want to return all the product data
with each letter typed.

I am imagining an autocompletion solution like this and would love to
get some input whether this is useful at all and could be implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo baz" ],
    something: "else"
    }

ES is definitely a good candidate to provide auto-suggest type
fonctionality.
IMHO the first thing you've got to focus-on is to decide the type of
auto-suggest you are looking for.
In your example ("foob", "fooba", "foobar"...), the auto-suggest seems
to be prefix-based. But is this always the case?
Also, what do you expect if someone enters a word which does not match
the prefix ? (for instance the user enters "francisco", do you expect
"san francisco" to be displayed?)
Another thing to consider is the expected behavior with sentences.
On top of my head you also should decide whether or not we'll need to
sort or highlight the results.

Based on this, you can make the right decision as to the most
appropriate combination of Tokenizer/Filters/Queries to use.

If you want to implement a quick autocompletion I suggest that you look
at the "text_phrase_prefix" query. it should give you a good
head-starthttp://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

Hope this helps.

Since auto-suggest seems to be popular, how about if we enhance ES doc
with the most common auto-suggest use cases and solutions? (with pros
and cons, snippets of code, tokenizer/filter/query to use and such).
If other people are interested, I can certainly contribute to it

So, would something like this be useful and can it be automatically
created out of the product feed (or any feed to be more generic) to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


(Bjfish) #8

Here is a StumbleUpon blog article about search suggestions with ES
that you might find useful if you have not seen it yet:
http://www.stumbleupon.com/su/6rCBMC/www.stumbleupon.com/devblog/searching-for-serendipity/

On Jul 4, 2:15 am, Alexander Reelsen
alexander.reel...@googlemail.com wrote:

Hi there,

I am currently trying to find the most elegant way to implement auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or to
use a specific auto completion index for this? To keep the returned
data as small as possible I do not want to return all the product data
with each letter typed.

I am imagining an autocompletion solution like this and would love to
get some input whether this is useful at all and could be implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo baz" ],
    something: "else"

}

So, would something like this be useful and can it be automatically
created out of the product feed (or any feed to be more generic) to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


(Shay Banon) #9

Heya Robert,

Its hard to answer on "suggestion" question, since suggestion can be interpreted in so many different manners... :).

So, I agree, for cases where the data is pretty much static or semi static, then the suggestion module is perfect, but there are many cases where its not... Suggestion module for user entered queries, that suggest popular queries falls nicely into one of those.

But, there are many other suggestion types. Starting from suggestion based on fields (like username, titles, or something similar), to more advance ones (geo based for example). Those lend more into the realtime scenario. Note, you can still build the suggestion from the fields (either fields that are not analyzed, or directly from "source", or, best, is "as you index"), so it does not necessarily works on analyzed terms.

-shay.banon

On Tuesday, July 5, 2011 at 3:29 PM, Robert Muir wrote:

But how much of a big deal is this in practice?

I suspect that most serious autosuggest users are populating their
suggest, not directly from the terms, but based on query logs
(probably with a lot of processing too!).
It may be the case query logs are also in a lucene index, in which
case the ability for suggest to pull from an index's terms is
convenient, but I'm really suspicious if you can get a nice suggest
based on the individual terms of the index itself.

I do happen to think you can get by ok this way for spellchecking,
because its an easier problem: adding context gets you the "final
inch" or last 5% or so, but doing it word-by-word is still ok.

all this being said, it might be possible to implement in a manner
similar to lucene 4.0's directspellchecker, but performance will be
tricky.
in https://issues.apache.org/jira/browse/LUCENE-2507, I benchmarked
the initial patch and it was unacceptable... only after Mike rewrote
the terms dictionary one or two times did we start to see acceptable
performance.

for suggest, the user might literally be waiting on this thing for
every keystroke: and that scares me. maybe if we added a 'direct
suggester' the perf would be ok for lots of people, but maybe it would
require specialized data structures (e.g. terms index/dict impls) to
get ok performance... but I'm personally a little discouraged from
looking at this myself, because if ultimately you cannot get good
quality from the index itself, it could be a big waste of time.

On Mon, Jul 4, 2011 at 6:37 PM, Michael McCandless
<mail@mikemccandless.com (mailto:mail@mikemccandless.com)> wrote:

Yeah I agree having to fully rebuild the suggest data structures
periodically isn't great.

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 3:33 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hi,
Yea, doing auto suggest can be done in several manners. The first is
simply executed a query against the relevant field, and only getting it
back. That field can be analyzed (for example, using ngrams) if it make
sense for the auto suggestion.
@mike: I saw the suggest module added, the problem with that is the fact
that it requires a full rebuild and can't be dynamically updated to reflect
changes in the index. This requires a system where periodic rebuilds are
done, and personally, not a big fan of that :). It can become expensive to
rebuild (file system cache invalidation), non intuitive (where it lacks real
or near real timeness). Though, the FST based one is cool :). If an in
memory auto suggest is required on a field, I was thinking of a non blocking
trie based data structure (derived from ConcurrentSkipList), but that
requires work :slight_smile:
-shay.banon

On Monday, July 4, 2011 at 3:49 PM, Michael McCandless wrote:

Also, as of Lucene 3.3, the spellchecker modules
(lucene/contrib/spellchecker) now has 3 implementations for
auto-suggest, so in theory ES can just expose these?

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 4:37 AM, Weiwei Wang <ww.wang.cs@gmail.com (mailto:ww.wang.cs@gmail.com)> wrote:

auto suggest can be implemented by add a field tokenized by
StandardTokenizer(or other) and filter by EdgeNGramFilter or
NGramFilter according to your project requirement.

You can look into the lucene contrib Spellchecker for some insight
understanding.

and this article
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
should be helpful for understanding search suggestion.

On Jul 4, 4:01 pm, stephane <stephane.bastian....@gmail.com (http://gmail.com)> wrote:

Hello,

On Mon, 2011-07-04 at 00:15 -0700, Alexander Reelsen wrote:

Hi there,

I am currently trying to find the most elegant way to implement auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or to
use a specific auto completion index for this? To keep the returned
data as small as possible I do not want to return all the product data
with each letter typed.

I am imagining an autocompletion solution like this and would love to
get some input whether this is useful at all and could be implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo baz" ],
    something: "else"
    }

ES is definitely a good candidate to provide auto-suggest type
fonctionality.
IMHO the first thing you've got to focus-on is to decide the type of
auto-suggest you are looking for.
In your example ("foob", "fooba", "foobar"...), the auto-suggest seems
to be prefix-based. But is this always the case?
Also, what do you expect if someone enters a word which does not match
the prefix ? (for instance the user enters "francisco", do you expect
"san francisco" to be displayed?)
Another thing to consider is the expected behavior with sentences.
On top of my head you also should decide whether or not we'll need to
sort or highlight the results.

Based on this, you can make the right decision as to the most
appropriate combination of Tokenizer/Filters/Queries to use.

If you want to implement a quick autocompletion I suggest that you look
at the "text_phrase_prefix" query. it should give you a good
head-starthttp://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

Hope this helps.

Since auto-suggest seems to be popular, how about if we enhance ES doc
with the most common auto-suggest use cases and solutions? (with pros
and cons, snippets of code, tokenizer/filter/query to use and such).
If other people are interested, I can certainly contribute to it

So, would something like this be useful and can it be automatically
created out of the product feed (or any feed to be more generic) to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


(Stephane Bastian) #10

Hi All,

As Shay pointed-out, auto-suggest type functionality must be handled
differently based on the expected behavior.

Currently ElasticSearch can handle most 'auto-suggest' use cases and I
believe most questions users have could be answered by some sort of FAQ
or a section in the doc.
Should we open a feature request to provide better doc around
auto-suggest and start documenting basic use cases for new ES users?

Stephane Bastian

On Wed, 2011-07-06 at 04:06 +0300, Shay Banon wrote:

Heya Robert,

Its hard to answer on "suggestion" question, since suggestion can

be interpreted in so many different manners... :).

So, I agree, for cases where the data is pretty much static or

semi static, then the suggestion module is perfect, but there are many
cases where its not... Suggestion module for user entered queries,
that suggest popular queries falls nicely into one of those.

But, there are many other suggestion types. Starting from

suggestion based on fields (like username, titles, or something
similar), to more advance ones (geo based for example). Those lend
more into the realtime scenario. Note, you can still build the
suggestion from the fields (either fields that are not analyzed, or
directly from "source", or, best, is "as you index"), so it does not
necessarily works on analyzed terms.

-shay.banon
On Tuesday, July 5, 2011 at 3:29 PM, Robert Muir wrote:

But how much of a big deal is this in practice?

I suspect that most serious autosuggest users are populating their
suggest, not directly from the terms, but based on query logs
(probably with a lot of processing too!).
It may be the case query logs are also in a lucene index, in which
case the ability for suggest to pull from an index's terms is
convenient, but I'm really suspicious if you can get a nice suggest
based on the individual terms of the index itself.

I do happen to think you can get by ok this way for spellchecking,
because its an easier problem: adding context gets you the "final
inch" or last 5% or so, but doing it word-by-word is still ok.

all this being said, it might be possible to implement in a manner
similar to lucene 4.0's directspellchecker, but performance will be
tricky.
in https://issues.apache.org/jira/browse/LUCENE-2507, I benchmarked
the initial patch and it was unacceptable... only after Mike rewrote
the terms dictionary one or two times did we start to see acceptable
performance.

for suggest, the user might literally be waiting on this thing for
every keystroke: and that scares me. maybe if we added a 'direct
suggester' the perf would be ok for lots of people, but maybe it
would
require specialized data structures (e.g. terms index/dict impls) to
get ok performance... but I'm personally a little discouraged from
looking at this myself, because if ultimately you cannot get good
quality from the index itself, it could be a big waste of time.

On Mon, Jul 4, 2011 at 6:37 PM, Michael McCandless
mail@mikemccandless.com wrote:

Yeah I agree having to fully rebuild the suggest data structures
periodically isn't great.

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 3:33 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Hi,
Yea, doing auto suggest can be done in several manners. The
first is
simply executed a query against the relevant field, and only
getting it
back. That field can be analyzed (for example, using ngrams) if
it make
sense for the auto suggestion.
@mike: I saw the suggest module added, the problem with that
is the fact
that it requires a full rebuild and can't be dynamically updated
to reflect
changes in the index. This requires a system where periodic
rebuilds are
done, and personally, not a big fan of that :). It can become
expensive to
rebuild (file system cache invalidation), non intuitive (where
it lacks real
or near real timeness). Though, the FST based one is cool :). If
an in
memory auto suggest is required on a field, I was thinking of a
non blocking
trie based data structure (derived from ConcurrentSkipList), but
that
requires work :slight_smile:
-shay.banon

On Monday, July 4, 2011 at 3:49 PM, Michael McCandless wrote:

Also, as of Lucene 3.3, the spellchecker modules
(lucene/contrib/spellchecker) now has 3 implementations for
auto-suggest, so in theory ES can just expose these?

Mike

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 4:37 AM, Weiwei Wang
ww.wang.cs@gmail.com wrote:

auto suggest can be implemented by add a field tokenized by
StandardTokenizer(or other) and filter by EdgeNGramFilter or
NGramFilter according to your project requirement.

You can look into the lucene contrib Spellchecker for some
insight
understanding.

and this article
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
should be helpful for understanding search suggestion.

On Jul 4, 4:01 pm, stephane stephane.bastian....@gmail.com
wrote:

Hello,

On Mon, 2011-07-04 at 00:15 -0700, Alexander Reelsen wrote:

Hi there,

I am currently trying to find the most elegant way to implement
auto
suggest feature for my elasticsearch instance.

Currently there is only an index with products, which includes
product
name, description, some image urls, some stock data etc...
Now I want to implement auto suggest only for the product name.

The first question is: Is it useful to use this product index or
to
use a specific auto completion index for this? To keep the
returned
data as small as possible I do not want to return all the
product data
with each letter typed.

I am imagining an autocompletion solution like this and would
love to
get some input whether this is useful at all and could be
implemented
with elasticsearch:

  1. Create a handler (not sure if its an index) on /suggest/
  2. On any request like /suggest/foo or /suggest/foobar search
    for an
    entry in the suggest index(?) and return something like
    {
    key: "foo",
    suggestions: [ "foob", "fooba", "foobar", "foobored", "foo
    baz" ],
    something: "else"
    }

ES is definitely a good candidate to provide auto-suggest type
fonctionality.
IMHO the first thing you've got to focus-on is to decide the
type of
auto-suggest you are looking for.
In your example ("foob", "fooba", "foobar"...), the auto-suggest
seems
to be prefix-based. But is this always the case?
Also, what do you expect if someone enters a word which does not
match
the prefix ? (for instance the user enters "francisco", do you
expect
"san francisco" to be displayed?)
Another thing to consider is the expected behavior with
sentences.
On top of my head you also should decide whether or not we'll
need to
sort or highlight the results.

Based on this, you can make the right decision as to the most
appropriate combination of Tokenizer/Filters/Queries to use.

If you want to implement a quick autocompletion I suggest that
you look
at the "text_phrase_prefix" query. it should give you a good
head-starthttp://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

Hope this helps.

Since auto-suggest seems to be popular, how about if we enhance
ES doc
with the most common auto-suggest use cases and solutions? (with
pros
and cons, snippets of code, tokenizer/filter/query to use and
such).
If other people are interested, I can certainly contribute to it

So, would something like this be useful and can it be
automatically
created out of the product feed (or any feed to be more generic)
to
ensure it includes all my product names (would not need to be
realtime, but updating once a day would be ok as well). Always
adding
all possible typing combinations manually sounds like quite some
overhead, if there could be analyzers doing this work...

Thanks for any input!

--Alexander


#11

On Tue, Jul 5, 2011 at 9:06 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Heya Robert,
Its hard to answer on "suggestion" question, since suggestion can be
interpreted in so many different manners... :).
So, I agree, for cases where the data is pretty much static or semi
static, then the suggestion module is perfect, but there are many cases
where its not... Suggestion module for user entered queries, that suggest
popular queries falls nicely into one of those.
But, there are many other suggestion types. Starting from suggestion
based on fields (like username, titles, or something similar), to more
advance ones (geo based for example). Those lend more into the realtime
scenario. Note, you can still build the suggestion from the fields (either
fields that are not analyzed, or directly from "source", or, best, is "as
you index"), so it does not necessarily works on analyzed terms.
-shay.banon

Well, for these simplistic suggestion types on unanalyzed content,
there is no problem, you just need to select a Lookup implementation
that supports add() (e.g. TST, Jaspell, but not FST)


(Shay Banon) #12

Well, for these simplistic suggestion types on unanalyzed content,
there is no problem, you just need to select a Lookup implementation
that supports add() (e.g. TST, Jaspell, but not FST)
Not sure they are that simplistic :), since it needs to be concurrent (even a read write lock around it will be expensive) and allow for "deletion" (as in reduce counter).


#13

On Wed, Jul 6, 2011 at 11:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Well, for these simplistic suggestion types on unanalyzed content,
there is no problem, you just need to select a Lookup implementation
that supports add() (e.g. TST, Jaspell, but not FST)

Not sure they are that simplistic :), since it needs to be concurrent (even
a read write lock around it will be expensive) and allow for "deletion" (as
in reduce counter).

I think thats really overkill? You could always have terms in lucene
that have all deleted docs, which will affect spellcheck, too.

but in both cases, for a normal search engine suggest, the following hold true:

  • normally you filter only high-freq terms (HighFrequencyDictionary or
    thresholdFrequency in lucene), so the chance of spellcorrecting a
    only-deleted docs term is minimized.
  • similar for suggest, the chance is minimal, e.g. if you are building
    from say terms of the past N-days of query logs I don't think you gain
    much by going to so much effort to expunge these all-deleted-terms for
    things people stopped querying on, e.g. a periodic rebuild is
    sufficient for your suggest to represent the past N-days trend.

But in general, this is the use case that the suggest/spellcheck
framework is geared towards (along with supplying floating point
quality weights, etc). If instead you want to do a suggest that acts
more like a primitive prefix query on a transactional database and
less like a search engine, I think using edge ngrams is the way to go,
as lucene will take care of that stuff for you.


(system) #14