[Ann] Elasticsearch Word Decompound Plugin

jprante · November 20, 2012, 11:47pm

Hi,

in my spare time this evening, while I'm still wrangling with some NLP
plugins (Stanford , UIMA, OpenNLP), and eagerly awaiting Lucene 4, I
reworked a Compact Patricia Trie implementation of Chris Biemann for a
german word decompounding Elasticsearch analysis plugin.

It can decompound german words like "Rechtsanwaltskanzleien" into "Recht,
anwalt, kanzlei" or "Jahresfeier" into "Jahr, feier". The best thing is,
you don't need to provide a word list.

You can find it
here: https://github.com/jprante/elasticsearch-analysis-decompound

Have fun!

Jörg

--

jarib · November 23, 2012, 12:10am

Very cool!

Do you expect this to work for other languages as well? I see Norwegian
mentioned in the README, which is exactly what I'm after.

I was debugging memory issues with ES today which turned out to be caused
by my (probably way too large) dictionary, so if this works out it's a
godsend.

Jari

--

jprante · November 23, 2012, 1:13am

Hi Jari,

just give it a shot. The CPT data provided is derived from the Leipzig
Wortschatz, which is german, so I doubt it works flawlessly for Norwegian.
I could try to ask Chris Biemann if he knows how to build Norwegian
decompounder CPTs.

Best regards,

Jörg

On Friday, November 23, 2012 1:10:05 AM UTC+1, jarib wrote:

Very cool!

Do you expect this to work for other languages as well? I see Norwegian
mentioned in the README, which is exactly what I'm after.

I was debugging memory issues with ES today which turned out to be
caused by my (probably way too large) dictionary, so if this works out it's
a godsend.

Jari

--

jarib · November 23, 2012, 1:24am

Hi Jörg,

I did some simple tests which appear to work ok, but if it's possible to
improve I'd be interested in working on it. Please ask Chris (or let me
know how to contact him)!

Jari

On Fri, Nov 23, 2012 at 2:13 AM, Jörg Prante joergprante@gmail.com wrote:

Hi Jari,

just give it a shot. The CPT data provided is derived from the Leipzig
Wortschatz, which is german, so I doubt it works flawlessly for Norwegian.
I could try to ask Chris Biemann if he knows how to build Norwegian
decompounder CPTs.

Best regards,

Jörg

On Friday, November 23, 2012 1:10:05 AM UTC+1, jarib wrote:

Very cool!

Do you expect this to work for other languages as well? I see Norwegian
mentioned in the README, which is exactly what I'm after.

I was debugging memory issues with ES today which turned out to be
caused by my (probably way too large) dictionary, so if this works out it's
a godsend.

Jari

--

--

jprante · November 23, 2012, 9:44am

Hi Jari,

the reason that you find it to work ok for norwegian is because there is a
strong relationship between norwegian and german language. Chris Biemann
confirmed, it is required to train the three Compact Patricia Tries (CPTs)
for other languages for correct decompounding. If an already decompounded
word list for norwegian can be provided, you are lucky. If not, he
suggested a rough approach, by using the Morfessor tool of
http://www.cis.hut.fi/projects/morpho/ that can automatically generate
decompounded word lists out of existing word lists, as they are provided
by http://corpora.informatik.uni-leipzig.de/download.html

I will see if I can provide a script in the plugin distribution ZIP that
can train CPTs for other languages beside german that have compounded and
agglutinated forms as well (scandinavian languages)

Cheers,

Jörg

On Friday, November 23, 2012 2:24:53 AM UTC+1, jarib wrote:

Hi Jörg,

I did some simple tests which appear to work ok, but if it's possible to
improve I'd be interested in working on it. Please ask Chris (or let me
know how to contact him)!

Jari

On Fri, Nov 23, 2012 at 2:13 AM, Jörg Prante <joerg...@gmail.com<javascript:>

wrote:

Hi Jari,

just give it a shot. The CPT data provided is derived from the Leipzig
Wortschatz, which is german, so I doubt it works flawlessly for Norwegian.
I could try to ask Chris Biemann if he knows how to build Norwegian
decompounder CPTs.

Best regards,

Jörg

On Friday, November 23, 2012 1:10:05 AM UTC+1, jarib wrote:

Very cool!

Do you expect this to work for other languages as well? I see Norwegian
mentioned in the README, which is exactly what I'm after.

I was debugging memory issues with ES today which turned out to be
caused by my (probably way too large) dictionary, so if this works out it's
a godsend.

Jari

--

--

jarib · November 23, 2012, 2:40pm

On Fri, Nov 23, 2012 at 10:44 AM, Jörg Prante joergprante@gmail.com wrote:

If an already decompounded word list for norwegian can be provided, you
are lucky.

Do you have an example of what this file should look like? The norwegian
spell check project at http://no.speling.org/ has a lot of relevant data.
I'll have to dig in to see if they have a proper decompounded word list.

I will see if I can provide a script in the plugin distribution ZIP that
can train CPTs for other languages beside german that have compounded and
agglutinated forms as well (scandinavian languages)

That would be fantastic. If I find the data, would you want new language
trees included in the plugin?

--

jprante · November 23, 2012, 4:58pm

Hi Jari,

On Friday, November 23, 2012 3:41:02 PM UTC+1, jarib wrote:

On Fri, Nov 23, 2012 at 10:44 AM, Jörg Prante <joerg...@gmail.com<javascript:>

wrote:

If an already decompounded word list for norwegian can be provided, you
are lucky.

Do you have an example of what this file should look like? The norwegian
spell check project at http://no.speling.org/ has a lot of relevant data.
I'll have to dig in to see if they have a proper decompounded word list.

Please refer to the Morfessor paper

where a decompounded word list look like

Smørbrød
Midtsommernattsdrøm
...

->

Smør + brød
*
Midt + sommer + natt + drøm
...

I will see if I can provide a script in the plugin distribution ZIP that
can train CPTs for other languages beside german that have compounded and
agglutinated forms as well (scandinavian languages)

That would be fantastic. If I find the data, would you want new language
trees included in the plugin?

Yes, I would do an update of the plugin, sure. With a ISO-639 language
parameter, you could select the CPTs for the language. Beside the script I
plan to develop so you could build CPTs for yourself... I don't think I can
handle Korean for example.

Cheers,

Jörg

--

fabik · November 28, 2012, 11:23am

Hey,

i just discovered your plugin after i was looking for something else to use than the nativ ES one. The problem i have, i am buildung a search for a products and running into the problem that some products have "herrenschuhe" and others "schuhe für herren" in the title. So my idea was to just run the filter against the titles and against the search query. But when running it against the search query it would break "herrenschuhe" into "herrenschuhe" + "herren" + "schuhe". To have this work the best way i would need the filter to drop the original "herrenschuhe". Would it be possible to add something like in the WordDelimiter, the preserve_original param?

Bruce_Ritchie · January 9, 2013, 5:12pm

Jörg,

This plugin is interesting to me for the German normalization so thanks for
putting it together! Any chance you could get this uploaded to the new
download.elasticsearch.org service (or maven) ? I had to manually hack the
github url to get at the 1.1.0 zip file for this plugin otherwise people
would have to build manually atm.

Regards,

Bruce Ritchie

On Tuesday, November 20, 2012 6:47:54 PM UTC-5, Jörg Prante wrote:

Hi,

in my spare time this evening, while I'm still wrangling with some NLP
plugins (Stanford , UIMA, OpenNLP), and eagerly awaiting Lucene 4, I
reworked a Compact Patricia Trie implementation of Chris Biemann for a
german word decompounding Elasticsearch analysis plugin.

It can decompound german words like "Rechtsanwaltskanzleien" into "Recht,
anwalt, kanzlei" or "Jahresfeier" into "Jahr, feier". The best thing is,
you don't need to provide a word list.

You can find it here:
GitHub - jprante/elasticsearch-analysis-decompound: Decompounding Plugin for Elasticsearch

Have fun!

Jörg

--

Ivan · January 9, 2013, 5:34pm

You can always download the file and install it locally (-url file://....).
No longer a cleaner one step process, but better than changing the source
(IMHO).

--
Ivan

On Wed, Jan 9, 2013 at 9:12 AM, Bruce Ritchie bruce.ritchie@gmail.comwrote:

This plugin is interesting to me for the German normalization so thanks
for putting it together! Any chance you could get this uploaded to the new
download.elasticsearch.org service (or maven) ? I had to manually hack
the github url to get at the 1.1.0 zip file for this plugin otherwise
people would have to build manually atm.

Regards,

Bruce Ritchie

On Tuesday, November 20, 2012 6:47:54 PM UTC-5, Jörg Prante wrote:

Hi,

in my spare time this evening, while I'm still wrangling with some NLP
plugins (Stanford , UIMA, OpenNLP), and eagerly awaiting Lucene 4, I
reworked a Compact Patricia Trie implementation of Chris Biemann for a
german word decompounding Elasticsearch analysis plugin.

It can decompound german words like "Rechtsanwaltskanzleien" into "Recht,
anwalt, kanzlei" or "Jahresfeier" into "Jahr, feier". The best thing is,
you don't need to provide a word list.

You can find it here: https://github.com/**jprante/elasticsearch-**
analysis-decompoundhttps://github.com/jprante/elasticsearch-analysis-decompound

Have fun!

Jörg

--

--

jprante · January 9, 2013, 8:33pm

Bruce,

thanks for your interest - right now there is no other method than
downloading with a full URL. Github will remove the ZIP files soon.

I have no access to the maven search site URL download or to
download.elasticsearch.org.

To improve the situation, I am reorganizing all my plugins now for better
distribution, more to be announced on this list. My plan is to setup a
Maven, RPM and deb distribution service at the brand new bintray.com
service.

Regards,

Jörg

--

Topic		Replies	Views
Re: elasticsearch and swedish compound words Elasticsearch	1	801	July 6, 2017
Basic word_list problem Elasticsearch	5	959	January 8, 2018
Dictionary_decompounder needs lowercase word lists if "lowercase" filter is used for query -> gotcha/bug? Elasticsearch	2	716	July 5, 2017
Multimatch with CROSS_FIELD query and decompounder Elasticsearch	2	409	March 14, 2022
Decompounder in query_string analyzer Elasticsearch	1	656	July 6, 2017

[Ann] Elasticsearch Word Decompound Plugin

Related topics