I wanted partial matching functionality on a field so I tried using the
nGram tokenizer in my index analyzer but just the standard tokenizer in my
search analyzer which worked perfectly.
However my question is how the performance scales when used with a large
amount of data because I would assume that this will result in a HUGE
amount of tokens in the index. Does anybody know if this is actually an
issue in elasticsearch, or would the best idea be to not use it on fields
with a lot of text (e.g. an article body) but smaller ones (e.g. an article
headline).
I wanted partial matching functionality on a field so I tried using the
nGram tokenizer in my index analyzer but just the standard tokenizer in my
search analyzer which worked perfectly.
However my question is how the performance scales when used with a large
amount of data because I would assume that this will result in a HUGE
amount of tokens in the index. Does anybody know if this is actually an
issue in elasticsearch, or would the best idea be to not use it on fields
with a lot of text (e.g. an article body) but smaller ones (e.g. an article
headline).
Indeed, n-grams are going to make your index larger, but if you want to
perform partial matching, they are the way to go. Although
Lucene/Elasticsearch performs well on prefix queries and there are tricks
to make it perform well on suffix queries as well (by applying a filter
that reverses the order of the characters), partial matching is very costly
as it requires to check every term of the terms dictionary. In that case,
n-grams will perform better even if they make the index larger.
Just a hint for NGrams if you are after autocompletion. Consider a
reasonable configuration. For example, you could check your index size
growth for nGrams longer than 2 and smaller than 6. This might reduce some
overhead in the dictionary.
I wanted partial matching functionality on a field so I tried using the
nGram tokenizer in my index analyzer but just the standard tokenizer in my
search analyzer which worked perfectly.
However my question is how the performance scales when used with a large
amount of data because I would assume that this will result in a HUGE
amount of tokens in the index. Does anybody know if this is actually an
issue in elasticsearch, or would the best idea be to not use it on fields
with a lot of text (e.g. an article body) but smaller ones (e.g. an article
headline).
Indeed, n-grams are going to make your index larger, but if you want to
perform partial matching, they are the way to go. Although
Lucene/Elasticsearch performs well on prefix queries and there are tricks
to make it perform well on suffix queries as well (by applying a filter
that reverses the order of the characters), partial matching is very costly
as it requires to check every term of the terms dictionary. In that case,
n-grams will perform better even if they make the index larger.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.