German stemmer - Cistem und Caumann stemmer

htsr · December 1, 2019, 5:22am

The stem algorithms available for German in Elasticsearch perform not good. But there are alternatives. Namely Apache Lucene/Solr offers already the GermanStemmer based on the Caumann stemming algorithm. There is the newer Cistem stemmer, too. This stemmer is already available for NLTK. The authors of the Cistem stem algorithm compared the different stem algorithms and it seems the Caumann and Cistem stemmers are the best stem algorithms for German (see the study results). Are there any plans to incorporate these two stemming algorithms into Elasticsearch or will it be the only option to write a custom plugin in order to use one of these German stemmers.

spinscale · December 2, 2019, 10:02am

The GermanStemmer is available via the german_stem token filter, as that one uses a GermanStemFilter, which in turn uses the GermanStemmer.

For the Cistem stemmer you would need to write your own plugin...

htsr · December 3, 2019, 3:39pm

Thanks for the reply.

I couldn't find any information about the GermanStemmer/GermanStemFilter in the elasticsearch documenation.
(Is there any reference to this stem filter in the official documentation?)

I tried the Cistem stemmer via a custum plugin, too.
For example the words Wunsch, Wünsche, gewünscht and wünschen where all reduced to the same stem (wunsch). The Caumann stemmer didn't do that.
But both stemmers seem to have their limitations. So for example Freund and Freunde where stemmed to a different stems than Freundin and Freundinnen.

spinscale · December 4, 2019, 12:55pm

stemmers for german languages are tricky in general. If those results are not good enough you could always resort to a dictionary based stemmer like the hunspell one.

We should update the documentation to update the german_stem. I'm happy to do that, unless you would be willing to submit a pull request.

--Alex

htsr · December 5, 2019, 5:09pm

Hi, Alex,
once again thanks for the information.

Another option could be mix stemming with synonyms for words known not to be stemmed appropriately. But that would be a somewhat static way to deal with this problem.

Could you please update the documentation.

cbuescher · December 6, 2019, 8:54am

While this is a possible workaround, please note that there might be caveats in ranking synonyms slightly differently than stemmed tokens. For example, we currently explore ways to slightly deboost matches on synonym tokens in search. The same might be useful for stemmed vs. non-stemmed tokens, but those two things are semantically different and might be treated differently in scoring in the long run.

cbuescher · December 6, 2019, 8:59am

btw. this sounds very interesting, which plugin did you use? Is it publicly available? Looks like the library is MIT licensed at least, which is great.

spinscale · December 6, 2019, 12:04pm

See https://github.com/LeonieWeissweiler/CISTEM/blob/master/Cistem.java (not the plugin, just the java class), I hope that's what you meant?

cbuescher · December 6, 2019, 1:36pm

Ah, okay. Sounded like there's a ready-to-install plugin already, but apparently not.

htsr · December 6, 2019, 2:31pm

I wrote my own custom plugin using the stempel and ICU plugins as templates.

cbuescher · December 6, 2019, 4:17pm

That sounds very interesting. I would be interested to try it if you want and are able to share, even if its in a WIP state, as long as I can build it from source. No problem if thats not possible though.

htsr · December 6, 2019, 5:50pm

If you are interested in testing the plugin, I will try my best. But it will be next week that I will notify you.

system · January 3, 2020, 5:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
German stemmer - looking for snowball alternative Elasticsearch	5	1997	July 6, 2017
Better French and German stemming? Elasticsearch	4	769	July 16, 2020
Improved stemming for Arabic Elasticsearch	2	1222	July 6, 2017
Stemming Elasticsearch	2	596	July 6, 2017
Search with stemming and stopwords (german) Elasticsearch	9	3392	July 6, 2017

German stemmer - Cistem und Caumann stemmer

Related topics