nGram performance

Hi,

I wanted partial matching functionality on a field so I tried using the
nGram tokenizer in my index analyzer but just the standard tokenizer in my
search analyzer which worked perfectly.

However my question is how the performance scales when used with a large
amount of data because I would assume that this will result in a HUGE
amount of tokens in the index. Does anybody know if this is actually an
issue in elasticsearch, or would the best idea be to not use it on fields
with a lot of text (e.g. an article body) but smaller ones (e.g. an article
headline).

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

On Thu, Jul 25, 2013 at 6:35 PM, Michael Lockwood
mlockwood.1989@gmail.comwrote:

Hi,

I wanted partial matching functionality on a field so I tried using the
nGram tokenizer in my index analyzer but just the standard tokenizer in my
search analyzer which worked perfectly.

However my question is how the performance scales when used with a large
amount of data because I would assume that this will result in a HUGE
amount of tokens in the index. Does anybody know if this is actually an
issue in elasticsearch, or would the best idea be to not use it on fields
with a lot of text (e.g. an article body) but smaller ones (e.g. an article
headline).

Indeed, n-grams are going to make your index larger, but if you want to
perform partial matching, they are the way to go. Although
Lucene/Elasticsearch performs well on prefix queries and there are tricks
to make it perform well on suffix queries as well (by applying a filter
that reverses the order of the characters), partial matching is very costly
as it requires to check every term of the terms dictionary. In that case,
n-grams will perform better even if they make the index larger.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Just a hint for NGrams if you are after autocompletion. Consider a
reasonable configuration. For example, you could check your index size
growth for nGrams longer than 2 and smaller than 6. This might reduce some
overhead in the dictionary.

Jörg

On Fri, Jul 26, 2013 at 11:25 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi,

On Thu, Jul 25, 2013 at 6:35 PM, Michael Lockwood <
mlockwood.1989@gmail.com> wrote:

Hi,

I wanted partial matching functionality on a field so I tried using the
nGram tokenizer in my index analyzer but just the standard tokenizer in my
search analyzer which worked perfectly.

However my question is how the performance scales when used with a large
amount of data because I would assume that this will result in a HUGE
amount of tokens in the index. Does anybody know if this is actually an
issue in elasticsearch, or would the best idea be to not use it on fields
with a lot of text (e.g. an article body) but smaller ones (e.g. an article
headline).

Indeed, n-grams are going to make your index larger, but if you want to
perform partial matching, they are the way to go. Although
Lucene/Elasticsearch performs well on prefix queries and there are tricks
to make it perform well on suffix queries as well (by applying a filter
that reverses the order of the characters), partial matching is very costly
as it requires to check every term of the terms dictionary. In that case,
n-grams will perform better even if they make the index larger.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.