Hi all. I've been digging around in ES lately to try it to do what I want. Part of that is to normalize so that diacritics and other accents etc are removed. And there are a number of ways to do this in ES. We have the ICU_folding filter for instance. But it folds ALL diacritics without regard of language. Likewise with the ASCIIFoldingFilter. Btw, the UTR#30 spec that ICU_folding is based on has NOT been approved as a standard by ICU. It may be useful still though... I want to retain the swedish characters åäöÅÄÖ, but fold all other variants. in swedish åäö is not variants of the letters aao, they are primary letters that has as much meaning as abc. For instance
kalla and källa are two distictly different words.
The same goes for the letter æ and ø in norwegian and danish.
So I came up with the solution to modify the standard Lucene ASCIFolding filter and have it ignore some configurable characters. I also optimized the filter to scan for lower case charcters first. The Lucene implementation mixes lower and capitals, but for one there are more lower case chars in a text and most often one would use a lower case filter first. The filter will normalize for instance (with ignore_chars=åäö) idé -> ide Lukáš -> Lukas Göteborg -> Göteborg I also created another filter that replaces characters. I have several cases with library cataloges where no distinction should be made between for instance 'v' and 'w'. I.e if you search for värmland you should get a hit for both Wärmland and Värmland. The filter will normalize the following (for a setting w=v) värmland = värmland wärmland = värmland The way to use both of these filters is: index : analysis : analyzer : default : type: custom tokenizer: standard filter: [lowercase, asciiFolding, replaceChars] filter : asciiFolding: type: se.devo.esfilter.ConfigurableASCIIFoldingTokenFilterFactory ignore_chars : åäö replaceChars: type: se.devo.esfilter.ReplaceCharTokenFilterFactory char_pairs : [w,v, j,i] Is this something of general interest so it should be contributed? Or is it better that I keep it a private plugin?
-- Kristian Jörg