Ways to handle umlauts

skauk · June 30, 2017, 1:18pm

I will very much appreciate an advice from the community on the best practices of handling umlauts for search.
What I have right now in my setup which is a mix of German and English is a asciifolding token filter with preserving the original which covers 90% of use cases.
In effect what it does is it emits an additional token for each token containing an umlaut with the umlaut replaced with a single character. However, to cover the rest 10% of cases I would like to consider words which are written with expanded umlauts. So for "Köln" I would like all of the following to be able to yield a match:

köln
koln
koeln

I've tried to add the missing third variant by using german_normalization filter. It works as intended but because it uses just simple substitution it also mangles words like "Raphael" to "raphal" which is something I don't want.
It appears that a good solution would be normalization filter which also preserves original token. However I can't find a way to create such a filter chain.

jprante · June 30, 2017, 3:19pm

Unfortunately, there is no "on size fits all" solution.

Mixing german and english words in the index is generally not a good idea, but it should work for umlauts, because they do not often appear in english words.

I use two alternatives, one with stemming using "German2" snowball stemmer, the other without stemming. See

github.com

jprante/elasticsearch-plugin-bundle/blob/master/src/test/resources/org/xbib/elasticsearch/index/analysis/german/unstemmed.json

{
    "index" : {
        "analysis" : {
            "filter" : {
                "snowball_german_umlaut" : {
                    "type" : "snowball",
                    "name" : "German2"
                },
                "standardnumber" : {
                    "type" : "standardnumber",
                    "standardnumbers" : [ "isbn" ]
                },
                "simple_hyphen" : {
                    "type" : "hyphen",
                    "subwords" : false
                },
                "german_stop_words" : {
                    "type" : "stop",
                    "stopwords" : [
                        "and",

This file has been truncated. show original

and

github.com

jprante/elasticsearch-plugin-bundle/blob/master/src/test/java/org/xbib/elasticsearch/index/analysis/german/UnstemmedTests.java

package org.xbib.elasticsearch.index.analysis.german;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import org.junit.Assert;
import org.junit.Test;
import org.xbib.elasticsearch.MapperTestUtils;

import java.io.IOException;
import java.io.StringReader;

/**
 *
 */
public class UnstemmedTests extends Assert {

    @Test
    public void testOne() throws IOException {

This file has been truncated. show original

The stemming variant may not be acceptable, because many german words collapse into the same word form in the index (known as "overstemming"). I try to soften this effect by indexing the original word form, too, with the help of the keyword repeat filter.

Because I keep the original form, I do not protect words like "Raphael" from stemming but it should be possible with the keyword marker token filter, see

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

system · July 28, 2017, 3:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Analyze German words with umlauts Elasticsearch	3	4210	July 5, 2017
Is umlaut expansion such as ü -> [ü, u, ue] possible with built in es tokenizer/filters? Elasticsearch	1	619	March 9, 2019
Documents with german umlauts Elasticsearch	3	2298	August 30, 2017
Char_filter for German Elasticsearch	19	2572	July 6, 2017
U-umlaut search --> indexing user name müller , search fails for müller but success for muller Elasticsearch	6	6192	July 5, 2017

Ways to handle umlauts

Related topics