Configurable ASCIIFolding and CharReplace filters done

rmuir · February 16, 2011, 3:40pm

On Wed, Feb 16, 2011 at 9:27 AM, Kristian Jörg krjg@devo.se wrote:

Hi all.

I've been digging around in ES lately to try it to do what I want. Part of
that is to normalize so that diacritics and other accents etc are removed.
And there are a number of ways to do this in ES. We have the ICU_folding
filter for instance. But it folds ALL diacritics without regard of language.
Likewise with the ASCIIFoldingFilter.
Btw, the UTR#30 spec that ICU_folding is based on has NOT been approved as a
standard by ICU. It may be useful still though...

I want to retain the swedish characters åäöÅÄÖ, but fold all other variants.
in swedish åäö is not variants of the letters aao, they are primary letters
that has as much meaning as abc. For instance
kalla and källa are two distictly different words. The same goes for the
letter æ and ø in norwegian and danish.

Right, the purpose of this is language-independent folding. its not
any unicode standard, just based off a nice set of work (withdrawn
standard) for what is just a heuristic.

If you want it to not fold certain things, you should use the expert
ctor with a FilteredNormalizer2.

example:

/* the normalizer2s here are immutable and can be static/thread-safe */
Normalizer2 base = Normalizer2.getInstance(
ICUFoldingFilter.class.getResourceAsStream("utr30.nrm"),
"utr30", Normalizer2.Mode.COMPOSE);
UnicodeSet filter = new UnicodeSet("[^åäöÅÄÖ]");
filter.freeze()
Normalizer2 filtered = new FilteredNormalizer2(base, filter);
TokenStream stream = new ICUNormalizer2Filter(stream, filtered);
...

Topic		Replies	Views
Asciifolding character filter Elasticsearch	4	830	July 6, 2017
Lang (czech) analyzer with asciifolding tokenizer or icu_tokenizer Elasticsearch	10	1236	July 6, 2017
Char_filter for German Elasticsearch	19	2707	July 6, 2017
Custom normalisation and filtering? Elasticsearch	10	1652	July 6, 2017
Folding German characters like umlauts Elasticsearch	11	4329	July 6, 2017

Configurable ASCIIFolding and CharReplace filters done

Related topics