I will very much appreciate an advice from the community on the best practices of handling umlauts for search.
What I have right now in my setup which is a mix of German and English is a asciifolding token filter with preserving the original which covers 90% of use cases.
In effect what it does is it emits an additional token for each token containing an umlaut with the umlaut replaced with a single character. However, to cover the rest 10% of cases I would like to consider words which are written with expanded umlauts. So for "Köln" I would like all of the following to be able to yield a match:
köln
koln
koeln
I've tried to add the missing third variant by using german_normalization filter. It works as intended but because it uses just simple substitution it also mangles words like "Raphael" to "raphal" which is something I don't want.
It appears that a good solution would be normalization filter which also preserves original token. However I can't find a way to create such a filter chain.
Unfortunately, there is no "on size fits all" solution.
Mixing german and english words in the index is generally not a good idea, but it should work for umlauts, because they do not often appear in english words.
I use two alternatives, one with stemming using "German2" snowball stemmer, the other without stemming. See
and
The stemming variant may not be acceptable, because many german words collapse into the same word form in the index (known as "overstemming"). I try to soften this effect by indexing the original word form, too, with the help of the keyword repeat filter.
Because I keep the original form, I do not protect words like "Raphael" from stemming but it should be possible with the keyword marker token filter, see
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.