The stem algorithms available for German in Elasticsearch perform not good. But there are alternatives. Namely Apache Lucene/Solr offers already the GermanStemmer based on the Caumann stemming algorithm. There is the newer Cistem stemmer, too. This stemmer is already available for NLTK. The authors of the Cistem stem algorithm compared the different stem algorithms and it seems the Caumann and Cistem stemmers are the best stem algorithms for German (see the study results). Are there any plans to incorporate these two stemming algorithms into Elasticsearch or will it be the only option to write a custom plugin in order to use one of these German stemmers.
The GermanStemmer is available via the german_stem
token filter, as that one uses a GermanStemFilter
, which in turn uses the GermanStemmer
.
For the Cistem stemmer you would need to write your own plugin...
Thanks for the reply.
I couldn't find any information about the GermanStemmer/GermanStemFilter in the elasticsearch documenation.
(Is there any reference to this stem filter in the official documentation?)
I tried the Cistem stemmer via a custum plugin, too.
For example the words Wunsch, Wünsche, gewünscht and wünschen where all reduced to the same stem (wunsch). The Caumann stemmer didn't do that.
But both stemmers seem to have their limitations. So for example Freund and Freunde where stemmed to a different stems than Freundin and Freundinnen.
stemmers for german languages are tricky in general. If those results are not good enough you could always resort to a dictionary based stemmer like the hunspell
one.
We should update the documentation to update the german_stem
. I'm happy to do that, unless you would be willing to submit a pull request.
--Alex
Hi, Alex,
once again thanks for the information.
Another option could be mix stemming with synonyms for words known not to be stemmed appropriately. But that would be a somewhat static way to deal with this problem.
Could you please update the documentation.
While this is a possible workaround, please note that there might be caveats in ranking synonyms slightly differently than stemmed tokens. For example, we currently explore ways to slightly deboost matches on synonym tokens in search. The same might be useful for stemmed vs. non-stemmed tokens, but those two things are semantically different and might be treated differently in scoring in the long run.
btw. this sounds very interesting, which plugin did you use? Is it publicly available? Looks like the library is MIT licensed at least, which is great.
See https://github.com/LeonieWeissweiler/CISTEM/blob/master/Cistem.java (not the plugin, just the java class), I hope that's what you meant?
Ah, okay. Sounded like there's a ready-to-install plugin already, but apparently not.
I wrote my own custom plugin using the stempel and ICU plugins as templates.
That sounds very interesting. I would be interested to try it if you want and are able to share, even if its in a WIP state, as long as I can build it from source. No problem if thats not possible though.
If you are interested in testing the plugin, I will try my best. But it will be next week that I will notify you.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.