I'm curious whether there exists an asciifolding character filter, I know
there is a asciifolding token filter and that the analysis chain works as
follows: input text > char_filter > tokenizer > token filter > output
tokens.
I also checked the icu-plugin: the icu_normalizer can be used both as a
character filter and a token filter. But the icu_folding filter is only
available as a token filter (that actually incorporates the icu_normalizer).
I'm generating ngrams and shingles, so it seems more logical to aplpy
ascii/icu folding as a character filter. But I can't find one?
I'm curious whether there exists an asciifolding character filter, I
know there is a asciifolding token filter and that the analysis chain
works as follows: input text > char_filter > tokenizer > token filter >
output tokens.
I also checked the icu-plugin: the icu_normalizer can be used both as a
character filter and a token filter. But the icu_folding filter is only
available as a token filter (that actually incorporates the icu_normalizer).
I'm generating ngrams and shingles, so it seems more logical to aplpy
ascii/icu folding as a character filter. But I can't find one?
On Mon, Jan 19, 2015 at 7:18 PM, Mathijs Biesmans <mathijs....@gmail.com
<javascript:>> wrote:
I'm curious whether there exists an asciifolding character filter, I
know there is a asciifolding token filter and that the analysis chain
works as follows: input text > char_filter > tokenizer > token filter >
output tokens.
I also checked the icu-plugin: the icu_normalizer can be used both as
a character filter and a token filter. But the icu_folding filter is
only available as a token filter (that actually incorporates the
icu_normalizer).
I'm generating ngrams and shingles, so it seems more logical to aplpy
ascii/icu folding as a character filter. But I can't find one?
I can't tell. Official elasticsearch ICU plugin is lagging behind Lucene
5.0, ICUCollationKeyAnalyzer / collation key field type, API deprecation
updates etc. so I hope it soon will take up pace again to get ready for new
features.
I'm curious whether there exists an asciifolding character filter, I
know there is a asciifolding token filter and that the analysis chain
works as follows: input text > char_filter > tokenizer > token filter >
output tokens.
The text on Elasticsearch Platform — Find real-time answers at scale | Elastic
current/asciifolding-token-filter.html mentions: [...]With Western
languages, this can be done with the asciifolding character filter.[...],
though the url says asciifolding-token-filter. An error in the docs?
I also checked the icu-plugin: the icu_normalizer can be used both as
a character filter and a token filter. But the icu_folding filter is
only available as a token filter (that actually incorporates the
icu_normalizer).
I'm generating ngrams and shingles, so it seems more logical to aplpy
ascii/icu folding as a character filter. But I can't find one?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.