Is umlaut expansion such as ü -> [ü, u, ue] possible with built in es tokenizer/filters?

Hi there elasticers!

my index is a list of documents that representing names of universities from all over the world in their "english" form.
our german users are having problems finding unis in Germany.
their instinct is to search using umlauts, example being "Münster".
But we have it (and others) listed in the "english" form e.g. "Muenster".

My first instinct was to try an asciifolding filter.
This made situation worse it increased recall by a mot!
typing in "Mü" was transformed to "Mu" and that returned a huge list of unis (from other countries), with Muenster being far far far down.

What then thought would be nice was to be able, via analyzer, to create more tokens for indexed docs.
so indexing Muenster would become -> [Münster, Munster, Muenster]
and then at search time typing in Mü would match (match_phrase_prefix) with one of Muenster's tokens to return it.

I tried a lot of approaches i came up myself and that i've found in topics on elastic boards, such as
snowball+German2 filter, icu_folding, german_normalization, char_filter and the combination of them.
These all again suffered the same problem, namely increasing recall by reducing any mü/mue combination to mu.

am i missing some other approach?
or do i need to develop a custom solution for this from 0?

I'm using elasticsearch 5.6.
elasticsearch 6.x solutions are also great because upgrading on my end is possible.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.