[Ann] Elasticsearch Analysis Baseform Plugin 1.0.0


(Jörg Prante) #1

Hi,

I have started a lexicon-based analyzer for linguistic processing of full
word forms to their base form (right now, only german lexicon is provided)

With this plugin, full word forms are reduced to base forms in the
tokenization process. This is also known as lemmatization.

Why is lemmatization better than stemming? With this plugin, you can
generate additional baseform tokens also for irregular word forms. Example:
for the word "zurückgezogen", the base form is "zurückziehen". Algorithmic
stemming would be rather limited for such cases.

Thanks to Dawid Weiss for the FSA and Daniel Naber for the german
fullform/baseform lexicon.

Cheers,

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Otis Gospodnetić) #2

Fantastiq! :wink:

Would it make sense to contribute the core of this to Lucene, where I'm
sure this sort of thing would thrive?

Thanks,
Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Monday, October 21, 2013 5:21:28 PM UTC-4, Jörg Prante wrote:

Hi,

I have started a lexicon-based analyzer for linguistic processing of full
word forms to their base form (right now, only german lexicon is provided)

https://github.com/jprante/elasticsearch-analysis-baseform

With this plugin, full word forms are reduced to base forms in the
tokenization process. This is also known as lemmatization.

Why is lemmatization better than stemming? With this plugin, you can
generate additional baseform tokens also for irregular word forms. Example:
for the word "zurückgezogen", the base form is "zurückziehen". Algorithmic
stemming would be rather limited for such cases.

Thanks to Dawid Weiss for the FSA and Daniel Naber for the german
fullform/baseform lexicon.

Cheers,

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #3

It already is (for polish) https://issues.apache.org/jira/browse/LUCENE-2341

My version is a stripped down version of Dawid Weiss' morfologik FSA,
attached with a reader for Daniel Naber's german lexicon, only for
lemmatization. Morfologik can do much more (POS tagging).

It should be possible to create something like morfologik-german,
morfologik-english morofologik-french etc. but I did not dig into it yet.

For Elasticsearch, Dariusz Gertych already implemented a morfologik plugin
for polish stemming based on Lucene

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4