[Ann] Elasticsearch Analysis Baseform Plugin 1.0.0

jprante · October 21, 2013, 9:21pm

Hi,

I have started a lexicon-based analyzer for linguistic processing of full
word forms to their base form (right now, only german lexicon is provided)

With this plugin, full word forms are reduced to base forms in the
tokenization process. This is also known as lemmatization.

Why is lemmatization better than stemming? With this plugin, you can
generate additional baseform tokens also for irregular word forms. Example:
for the word "zurückgezogen", the base form is "zurückziehen". Algorithmic
stemming would be rather limited for such cases.

Thanks to Dawid Weiss for the FSA and Daniel Naber for the german
fullform/baseform lexicon.

Cheers,

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

otisg · October 22, 2013, 2:25am

Fantastiq!

Would it make sense to contribute the core of this to Lucene, where I'm
sure this sort of thing would thrive?

Thanks,
Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Monday, October 21, 2013 5:21:28 PM UTC-4, Jörg Prante wrote:

Hi,

I have started a lexicon-based analyzer for linguistic processing of full
word forms to their base form (right now, only german lexicon is provided)

GitHub - jprante/elasticsearch-analysis-baseform: Baseform lemmatization for Elasticsearch

With this plugin, full word forms are reduced to base forms in the
tokenization process. This is also known as lemmatization.

Why is lemmatization better than stemming? With this plugin, you can
generate additional baseform tokens also for irregular word forms. Example:
for the word "zurückgezogen", the base form is "zurückziehen". Algorithmic
stemming would be rather limited for such cases.

Thanks to Dawid Weiss for the FSA and Daniel Naber for the german
fullform/baseform lexicon.

Cheers,

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · October 22, 2013, 7:02am

It already is (for polish) https://issues.apache.org/jira/browse/LUCENE-2341

My version is a stripped down version of Dawid Weiss' morfologik FSA,
attached with a reader for Daniel Naber's german lexicon, only for
lemmatization. Morfologik can do much more (POS tagging).

It should be possible to create something like morfologik-german,
morfologik-english morofologik-french etc. but I did not dig into it yet.

For Elasticsearch, Dariusz Gertych already implemented a morfologik plugin
for polish stemming based on Lucene

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
[ANN] Elasticsearch Analysis Baseform plugin 1.1.0 Elasticsearch	9	441	July 6, 2017
Elasticsearch : Access the lemmas of the analyzed text Elasticsearch	1	261	June 22, 2022
Is there any french lemmatizer available for ElasticSearch? Elasticsearch	3	810	May 25, 2017
Better French and German stemming? Elasticsearch	4	769	July 16, 2020
Baseform plugin not working for me Elasticsearch	3	346	July 6, 2017

[Ann] Elasticsearch Analysis Baseform Plugin 1.0.0

Thanks, Otis

Related topics

Thanks,
Otis