Stemming with german2 on hyphenated compound words

Hello,

I have a problem using the german2 stemmer on hyphenated compound words with Elasticsearch.

As an example I have 2 words: "Export-Schnittstelle" and "Schnittstelle", for these words the stemmer creates "Export-Schnittstell" or "Schnittstell" respectively, which is great because with the right tokenization I can now search for "Schnitstelle" (which the stemmer within my search analyzer will transform to "Schnittstell") and it will match the second part from the word "Export-Schnittstelle" aka "Export-Schnittstell".

Now I would expect that this is how it works for all hyphenated compound words. But unfortunately that's not the case. So I now have 2 other words "PA-Schiene" and "Schiene". Here the stemmer creates two completely different words: "PA-Schi" and "Schien".

Can someone explain to my why this is and if there is a way to fix this? Maybe by using different stemming, like light_german oder minimal_german?

Thanks in advance.

Best Regards
Simon

It seems that using "minimal_german" for stemming solves this problem without splitting the word on hyphens, which I don't want to do.

Hi @simon137

I don't know what your analyzer is like, but the code below generates the tokens. If you use the token without breaking the word, you can use the whitespace tokenizer to see that the tokens were also correct.

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "stemmer",
      "language": "german"
    }
  ],
  "text": [
    "Export-Schnittstelle",
    "Schnittstelle",
    "PA-Schiene",
    "Schiene"
  ]
}