On Wed, Feb 16, 2011 at 9:27 AM, Kristian Jörg krjg@devo.se wrote:
Hi all.
I've been digging around in ES lately to try it to do what I want. Part of
that is to normalize so that diacritics and other accents etc are removed.
And there are a number of ways to do this in ES. We have the ICU_folding
filter for instance. But it folds ALL diacritics without regard of language.
Likewise with the ASCIIFoldingFilter.
Btw, the UTR#30 spec that ICU_folding is based on has NOT been approved as a
standard by ICU. It may be useful still though...I want to retain the swedish characters åäöÅÄÖ, but fold all other variants.
in swedish åäö is not variants of the letters aao, they are primary letters
that has as much meaning as abc. For instance
kalla and källa are two distictly different words. The same goes for the
letter æ and ø in norwegian and danish.
Right, the purpose of this is language-independent folding. its not
any unicode standard, just based off a nice set of work (withdrawn
standard) for what is just a heuristic.
If you want it to not fold certain things, you should use the expert
ctor with a FilteredNormalizer2.
example:
/* the normalizer2s here are immutable and can be static/thread-safe */
Normalizer2 base = Normalizer2.getInstance(
ICUFoldingFilter.class.getResourceAsStream("utr30.nrm"),
"utr30", Normalizer2.Mode.COMPOSE);
UnicodeSet filter = new UnicodeSet("[^åäöÅÄÖ]");
filter.freeze()
Normalizer2 filtered = new FilteredNormalizer2(base, filter);
TokenStream stream = new ICUNormalizer2Filter(stream, filtered);
...