Analyze German words with umlauts


(Volodymyr Usarskyy) #1

Hello everyone!

I have a German word with umlaut, lets say it is "läuft". My target is to create an analyzer that produces three tokens at the end: "läuft", "laeuft" and "lauft".

I have tried different combinations with icu_normalizer, asciifolding and snowball for German2 filters but no results. The best result I've got from asciifolding token filter that emits two out of three required tokens: "läuft" and "lauft".

So, basically, I need to create some kind of custom asciifolding filter for German language that will allow to emit additional variations for words with umlauts.

My configuration for asciifolding and snowball filters are the following:

"ascii2": {
              "type": "asciifolding",
              "preserve_original": "true"
            },

"snow-german2": {
              "type": "snowball",
              "language": "German2"
            },

I would be really appreciated for your help!


#2

Hi,
You should try the Combo analyzer plugin : https://github.com/yakaz/elasticsearch-analysis-combo/
it can combine multiple analyzers. For example, the one you mentioned (läuft => läuft, lauft) and another one (läuft => laeuft), with a regexp (char mapping or pattern replace).


(Volodymyr Usarskyy) #3

Yes, we came up to the same conclusion on SO (http://stackoverflow.com/questions/32114129/elasticsearch-analyzer-for-german-language). It seems to be the only possible solution for now.

Thank you for your advise!


(system) #4