edgeNGram filter not keeping the whole words

Patrick_Norwood · July 26, 2013, 8:31am

Hi guys, I have a problem with my analyzer. Following is the setup in
elasticsearch.yml:
analyzer :
default_index :
type : custom
tokenizer : whitespace
filter : [word_delimiter, asciifolding, standard,
lowercase, synonym, edgeNGram]
filter :
word_delimiter :
type : word_delimiter
preserve_original : true
edgeNGram :
type : edgeNGram
min_gram : 2
max_gram : 15
side : front

The problem is that when I run curl -XGET
'localhost:9200/my_index/_analyze?field=email_doc.body' -d
'WordWithOverFifteenChars', the result is 'WordWithOverFifteenChars' broken
down into n-grams up to 15 chars, but the whole word does not get indexed.
The same happens for words like 'something.company', that means the whole
world (preserve_orginal settings for word_delimiter filter) doesn't get
preserved, it only gets broken down into n-grams as a whole up to
'something.compa' (15 chars) as well as n-grams for 'company' and
'something' are created.
Could you give me your insight on what I am missing?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jpountz · July 26, 2013, 9:20am

Hi,

Although preserve_original is set to true on the word_delimiter filter, the
edgeNGram filter is applied afterwards, meaning that the analyzer will
compute n-grams for both the original word and the sub-words created by the
word_delimiter filter.

If you want to keep the original token, it might make sense to use a multi
field[1] and analyze your field once with word_delimiter and once with
n-grams. Then you can use any of those fields depending on what you are
trying to achieve : prefix search with the field analyzed with edgeNGram
and standard search with the field analyzed with word_delimiter.

[1] http://www.elasticsearch.org/guide/reference/mapping/multi-field-type/

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Using word_delimiter with edgeNGram ignores Word_Delimiter Token Elasticsearch	3	468	July 5, 2017
Word delimiter filter with preserve_original Elasticsearch	4	685	December 18, 2019
Analyzer problem Elasticsearch	3	341	July 6, 2017
edgeNGram tokenizer with the word delimiter filter Elasticsearch	2	525	July 6, 2017
Customized analyzers behave Elasticsearch	2	413	July 6, 2017

edgeNGram filter not keeping the whole words

Related topics