I have started a lexicon-based analyzer for linguistic processing of full
word forms to their base form (right now, only german lexicon is provided)
With this plugin, full word forms are reduced to base forms in the
tokenization process. This is also known as lemmatization.
Why is lemmatization better than stemming? With this plugin, you can
generate additional baseform tokens also for irregular word forms. Example:
for the word "zurückgezogen", the base form is "zurückziehen". Algorithmic
stemming would be rather limited for such cases.
Thanks to Dawid Weiss for the FSA and Daniel Naber for the german
fullform/baseform lexicon.
With this plugin, full word forms are reduced to base forms in the
tokenization process. This is also known as lemmatization.
Why is lemmatization better than stemming? With this plugin, you can
generate additional baseform tokens also for irregular word forms. Example:
for the word "zurückgezogen", the base form is "zurückziehen". Algorithmic
stemming would be rather limited for such cases.
Thanks to Dawid Weiss for the FSA and Daniel Naber for the german
fullform/baseform lexicon.
My version is a stripped down version of Dawid Weiss' morfologik FSA,
attached with a reader for Daniel Naber's german lexicon, only for
lemmatization. Morfologik can do much more (POS tagging).
It should be possible to create something like morfologik-german,
morfologik-english morofologik-french etc. but I did not dig into it yet.
For Elasticsearch, Dariusz Gertych already implemented a morfologik plugin
for polish stemming based on Lucene
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.