Hi all.
I've been digging around in ES lately to try it to do what I want.
Part of that is to normalize so that diacritics and other accents
etc are removed.
And there are a number of ways to do this in ES. We have the
ICU_folding filter for instance. But it folds ALL diacritics without
regard of language. Likewise with the ASCIIFoldingFilter.
Btw, the UTR#30 spec that ICU_folding is based on has NOT been
approved as a standard by ICU. It may be useful still though...
I want to retain the swedish characters åäöÅÄÖ, but fold all other
variants.
in swedish åäö is not variants of the letters aao, they are primary
letters that has as much meaning as abc. For instance
kalla and källa are two distictly different words.
The same goes for the letter æ and ø in norwegian and danish.
So I came up with the solution to modify the standard Lucene
ASCIFolding filter and have it ignore some configurable characters.
I also optimized the filter to scan for lower case charcters first.
The Lucene implementation mixes lower and capitals, but for one
there are more lower case chars in a text and most often one would
use a lower case filter first.
The filter will normalize for instance (with ignore_chars=åäö)
idé -> ide
Lukáš -> Lukas
Göteborg -> Göteborg
I also created another filter that replaces characters. I have
several cases with library cataloges where no distinction should be
made between for instance 'v' and 'w'. I.e if you search for
värmland you should get a hit for both Wärmland and Värmland.
The filter will normalize the following (for a setting w=v)
värmland = värmland
wärmland = värmland
The way to use both of these filters is:
index :
analysis :
analyzer :
default :
type: custom
tokenizer: standard
filter: [lowercase, asciiFolding, replaceChars]
filter :
asciiFolding:
type:
se.devo.esfilter.ConfigurableASCIIFoldingTokenFilterFactory
ignore_chars : åäö
replaceChars:
type: se.devo.esfilter.ReplaceCharTokenFilterFactory
char_pairs : [w,v, j,i]
Is this something of general interest so it should be contributed?
Or is it better that I keep it a private plugin?
-- Kristian Jörg