Stemming with german2 on hyphenated compound words

simon137 · September 24, 2024, 8:27am

Hello,

I have a problem using the german2 stemmer on hyphenated compound words with Elasticsearch.

As an example I have 2 words: "Export-Schnittstelle" and "Schnittstelle", for these words the stemmer creates "Export-Schnittstell" or "Schnittstell" respectively, which is great because with the right tokenization I can now search for "Schnitstelle" (which the stemmer within my search analyzer will transform to "Schnittstell") and it will match the second part from the word "Export-Schnittstelle" aka "Export-Schnittstell".

Now I would expect that this is how it works for all hyphenated compound words. But unfortunately that's not the case. So I now have 2 other words "PA-Schiene" and "Schiene". Here the stemmer creates two completely different words: "PA-Schi" and "Schien".

Can someone explain to my why this is and if there is a way to fix this? Maybe by using different stemming, like light_german oder minimal_german?

Thanks in advance.

Best Regards
Simon

simon137 · September 25, 2024, 12:38pm

It seems that using "minimal_german" for stemming solves this problem without splitting the word on hyphens, which I don't want to do.

RabBit_BR · September 25, 2024, 2:03pm

Hi @simon137

I don't know what your analyzer is like, but the code below generates the tokens. If you use the token without breaking the word, you can use the whitespace tokenizer to see that the tokens were also correct.

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "stemmer",
      "language": "german"
    }
  ],
  "text": [
    "Export-Schnittstelle",
    "Schnittstelle",
    "PA-Schiene",
    "Schiene"
  ]
}

Topic		Replies	Views
Compound word token filter with german umlaute Elasticsearch	1	691	December 1, 2018
Compound Words not found but Filter is configured Elasticsearch	5	651	July 5, 2017
Polish stemming plugin (Stempel 2.4.4) not working on ES 2.4 Elasticsearch	1	472	March 9, 2017
German stemmer - Cistem und Caumann stemmer Elasticsearch	12	1922	January 3, 2020
Elasticsearch portuguese stemmer inconsistency Elasticsearch	1	404	March 31, 2021

Stemming with german2 on hyphenated compound words

Related topics