Searching for umlauts

Hello, I use ES in v5.6.4 and have the following setup:

DELETE telephone_book

PUT telephone_book
{
  "settings": {
    "analysis": {
      "filter": {
        "german_stemmer": {
          "type": "stemmer",
          "language": "light_german"
        }
      },
      "tokenizer": {
        "ngram": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3
        }
      },
      "analyzer": {
        "german": {
          "tokenizer": "ngram",
          "filter": [
            "lowercase",
            "german_normalization",
            "german_stemmer"
          ]
        },
        "ngram": {
          "tokenizer": "ngram"
        }
      }
    }
  },
  "mappings": {
    "entry": {
      "dynamic_templates": [
        {
          "general_data": {
            "path_match": "variations.generalData.*",
            "mapping": {
              "type": "text",
              "analyzer": "german",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        {
          "addresses": {
            "path_match": "variations.addresses.*",
            "mapping": {
              "type": "text",
              "analyzer": "ngram",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        }
      ],
      "properties": {
        "variations": {
          "type": "nested"
        }
      }
    }
  }
}

POST telephone_book/entry
{
  "variations": [
    {
      "generalData": {
        "name": "Müller"
      },
      "addresses": {
        "phone": "123456789"
      }
    }
  ]
}

POST telephone_book/entry
{
  "variations": [
    {
      "generalData": {
        "name": "Mueller",
        "test": "Weiß"
      }
    }
  ]
}

POST telephone_book/entry
{
  "variations": [
    {
      "generalData": {
        "name": "Muller"
      }
    }
  ]
}

GET /telephone_book/_search
{
  "query": {
    "nested": {
      "path": "variations",
      "query": {
        "query_string": {
          "fields": ["variations.generalData.*", "variations.addresses.*"],
          "query": "Muell"
        }
      }
    }
  }
}

Since I use a german-language analyzer I would expect to find "Müller" when searching for "Muell" but the only response I get is the "Mueller" entry.

Here is a simpler example:

DELETE telephone_book
PUT telephone_book
{
  "settings": {
    "analysis": {
      "filter": {
        "german_stemmer": {
          "type": "stemmer",
          "language": "light_german"
        }
      },
      "tokenizer": {
        "ngram": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3
        }
      },
      "analyzer": {
        "german": {
          "tokenizer": "ngram",
          "filter": [
            "lowercase",
            "german_normalization",
            "german_stemmer"
          ]
        },
        "ngram": {
          "tokenizer": "ngram"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "german"
        }
      }
    }
  }
}
POST telephone_book/_doc
{
  "name": "Müller"
}
POST telephone_book/_doc
{
  "name": "Mueller"
}
POST telephone_book/_doc
{
  "name": "Muller"
}
GET /telephone_book/_search
{
  "query": {
    "query_string": {
      "fields": [
        "name"
      ],
      "query": "Muell"
    }
  }
}

This is producing the same result. So please try to keep examples as simple as possible. That helps to get a faster response.

Have a look at this:

POST telephone_book/_analyze
{
  "analyzer": "german",
  "text": ["Muller"]
}
POST telephone_book/_analyze
{
  "analyzer": "german",
  "text": ["Müller"]
}
POST telephone_book/_analyze
{
  "analyzer": "german",
  "text": ["Mueller"]
}
POST telephone_book/_analyze
{
  "analyzer": "german",
  "text": ["Muell"]
}

If you run this, you will exactly which are the tokens generated at index time and at search time. That will help you to understand why only Mueller is matching when searching for Muell.

Using "explain": true will give you all the transformations that are happening when using an analyzer:

POST telephone_book/_analyze
{
  "explain": true, 
  "analyzer": "german",
  "text": ["Muller"]
}

HTH

Thanks for the tip with the post analyze, that is really interesting.

Sorry for my "complicated" example but I just wanted to make clear how my setup is (with a nested path) and I didn't know if that may be the problem.

So, back to my problem: I searched a lot about this topic and after a week still can't come to a conclusion how to setup my index that searching for "ue" will match an "ü". Some posts talk about a snowball-plugin, I tried to use all the different Stemmer (german, german2, light_german, minimal_german) and other posts are just so old that I just wanted a confirm that this is still the right way to do it.

If the answer is: You just CAN'T receive "ü"-results when searching for "ue" then I (and especially my customers) will have to accept this. But I just wanted to see if somehow there is a way to achieve this.

if ue and ü should be "synonyms", you can may be use a synonym token filter in your custom analyzer?

I'm surprised though that this is needed and that no other german hit such a thing. I'm sadly not good enough in german to comment more than that :slight_smile:

Pinging @xeraa in case he has better ideas.

Yeah it is a bug that popped up @ my workplace. Another coworker said that he barely sees a working search for umlauts in other addressbooks like Outlook etc.
So as I said, if nothing works (I will try out the synonym token filter) a "no" is also an answer I will accept^^

@cherry-wave I think what you are looking for is the german_normalization, which you already have in your example:

'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
'ae' and 'oe' are replaced by 'a', and 'o', respectively.
'ue' is replaced by 'u', when not following a vowel or q.

Let's try to put this together the right way. You want to tokenize, normalize (optionally stem, but this doesn't make a difference for this example so I've left it out), and only build the ngrams at the very end. You've basically cut out the ngrams too early, so the normalization didn't work:

DELETE telephone_book
PUT telephone_book
{
  "settings": {
    "analysis": {
      "analyzer": {
        "german": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "german_normalization",
            "ngram_filter"
          ]
        }
      },
      "filter": {
        "ngram_filter": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "german"
        }
      }
    }
  }
}
POST telephone_book/_doc
{
  "name": "Müller"
}
POST telephone_book/_doc
{
  "name": "Mueller"
}
POST telephone_book/_doc
{
  "name": "Muller"
}
GET /telephone_book/_search
{
  "query": {
    "query_string": {
      "fields": [
        "name"
      ],
      "query": "Muell"
    }
  }
}

This should work and you can also check the ngrams with the _analyze endpoint.

If you need to customize those rules (for example 'ue' should always be replaced by 'u') you'll need to write your own char filter; probably a mapping char filter.

2 Likes

Thanks, that works like a charm! Using an ngram-filter instead of a tokenizer, wow!

Thanks to the "_analyze" and "explain" I understand better know what is happening and also found out that if using a stemmer, I should add it AFTER the ngram-filter, otherwise my "ends with"-search won't work.

Great!

I'm not sure if stemming and ngrams make much sense together. I'd analyze a field multiple times and then search over all of them to get the best results.

PS: Something that you might want to include for German are decompounders, unless this is already sufficiently covered by ngrams.

1 Like

I only had this in my analyzer because it was the default posted here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html

I also don't think I will need a stemmer in a telephone book.

Regarding your "PS": I can't think of an example where this would lead to problems. I too think that using the ngram-filter already handles that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.