in my spare time this evening, while I'm still wrangling with some NLP
plugins (Stanford , UIMA, OpenNLP), and eagerly awaiting Lucene 4, I
reworked a Compact Patricia Trie implementation of Chris Biemann for a
german word decompounding Elasticsearch analysis plugin.
It can decompound german words like "Rechtsanwaltskanzleien" into "Recht,
anwalt, kanzlei" or "Jahresfeier" into "Jahr, feier". The best thing is,
you don't need to provide a word list.
Do you expect this to work for other languages as well? I see Norwegian
mentioned in the README, which is exactly what I'm after.
I was debugging memory issues with ES today which turned out to be caused
by my (probably way too large) dictionary, so if this works out it's a
godsend.
just give it a shot. The CPT data provided is derived from the Leipzig
Wortschatz, which is german, so I doubt it works flawlessly for Norwegian.
I could try to ask Chris Biemann if he knows how to build Norwegian
decompounder CPTs.
Best regards,
Jörg
On Friday, November 23, 2012 1:10:05 AM UTC+1, jarib wrote:
Very cool!
Do you expect this to work for other languages as well? I see Norwegian
mentioned in the README, which is exactly what I'm after.
I was debugging memory issues with ES today which turned out to be
caused by my (probably way too large) dictionary, so if this works out it's
a godsend.
I did some simple tests which appear to work ok, but if it's possible to
improve I'd be interested in working on it. Please ask Chris (or let me
know how to contact him)!
just give it a shot. The CPT data provided is derived from the Leipzig
Wortschatz, which is german, so I doubt it works flawlessly for Norwegian.
I could try to ask Chris Biemann if he knows how to build Norwegian
decompounder CPTs.
Best regards,
Jörg
On Friday, November 23, 2012 1:10:05 AM UTC+1, jarib wrote:
Very cool!
Do you expect this to work for other languages as well? I see Norwegian
mentioned in the README, which is exactly what I'm after.
I was debugging memory issues with ES today which turned out to be
caused by my (probably way too large) dictionary, so if this works out it's
a godsend.
the reason that you find it to work ok for norwegian is because there is a
strong relationship between norwegian and german language. Chris Biemann
confirmed, it is required to train the three Compact Patricia Tries (CPTs)
for other languages for correct decompounding. If an already decompounded
word list for norwegian can be provided, you are lucky. If not, he
suggested a rough approach, by using the Morfessor tool of http://www.cis.hut.fi/projects/morpho/ that can automatically generate
decompounded word lists out of existing word lists, as they are provided
by http://corpora.informatik.uni-leipzig.de/download.html
I will see if I can provide a script in the plugin distribution ZIP that
can train CPTs for other languages beside german that have compounded and
agglutinated forms as well (scandinavian languages)
Cheers,
Jörg
On Friday, November 23, 2012 2:24:53 AM UTC+1, jarib wrote:
Hi Jörg,
I did some simple tests which appear to work ok, but if it's possible to
improve I'd be interested in working on it. Please ask Chris (or let me
know how to contact him)!
Jari
On Fri, Nov 23, 2012 at 2:13 AM, Jörg Prante <joerg...@gmail.com<javascript:>
wrote:
Hi Jari,
just give it a shot. The CPT data provided is derived from the Leipzig
Wortschatz, which is german, so I doubt it works flawlessly for Norwegian.
I could try to ask Chris Biemann if he knows how to build Norwegian
decompounder CPTs.
Best regards,
Jörg
On Friday, November 23, 2012 1:10:05 AM UTC+1, jarib wrote:
Very cool!
Do you expect this to work for other languages as well? I see Norwegian
mentioned in the README, which is exactly what I'm after.
I was debugging memory issues with ES today which turned out to be
caused by my (probably way too large) dictionary, so if this works out it's
a godsend.
If an already decompounded word list for norwegian can be provided, you
are lucky.
Do you have an example of what this file should look like? The norwegian
spell check project at http://no.speling.org/ has a lot of relevant data.
I'll have to dig in to see if they have a proper decompounded word list.
I will see if I can provide a script in the plugin distribution ZIP that
can train CPTs for other languages beside german that have compounded and
agglutinated forms as well (scandinavian languages)
That would be fantastic. If I find the data, would you want new language
trees included in the plugin?
On Friday, November 23, 2012 3:41:02 PM UTC+1, jarib wrote:
On Fri, Nov 23, 2012 at 10:44 AM, Jörg Prante <joerg...@gmail.com<javascript:>
wrote:
If an already decompounded word list for norwegian can be provided, you
are lucky.
Do you have an example of what this file should look like? The norwegian
spell check project at http://no.speling.org/ has a lot of relevant data.
I'll have to dig in to see if they have a proper decompounded word list.
Please refer to the Morfessor paper
where a decompounded word list look like
Smørbrød Midtsommernattsdrøm
...
->
Smør + brød
*
Midt + sommer + natt + drøm
...
I will see if I can provide a script in the plugin distribution ZIP that
can train CPTs for other languages beside german that have compounded and
agglutinated forms as well (scandinavian languages)
That would be fantastic. If I find the data, would you want new language
trees included in the plugin?
Yes, I would do an update of the plugin, sure. With a ISO-639 language
parameter, you could select the CPTs for the language. Beside the script I
plan to develop so you could build CPTs for yourself... I don't think I can
handle Korean for example.
i just discovered your plugin after i was looking for something else to use than the nativ ES one. The problem i have, i am buildung a search for a products and running into the problem that some products have "herrenschuhe" and others "schuhe für herren" in the title. So my idea was to just run the filter against the titles and against the search query. But when running it against the search query it would break "herrenschuhe" into "herrenschuhe" + "herren" + "schuhe". To have this work the best way i would need the filter to drop the original "herrenschuhe". Would it be possible to add something like in the WordDelimiter, the preserve_original param?
This plugin is interesting to me for the German normalization so thanks for
putting it together! Any chance you could get this uploaded to the new download.elasticsearch.org service (or maven) ? I had to manually hack the
github url to get at the 1.1.0 zip file for this plugin otherwise people
would have to build manually atm.
Regards,
Bruce Ritchie
On Tuesday, November 20, 2012 6:47:54 PM UTC-5, Jörg Prante wrote:
Hi,
in my spare time this evening, while I'm still wrangling with some NLP
plugins (Stanford , UIMA, OpenNLP), and eagerly awaiting Lucene 4, I
reworked a Compact Patricia Trie implementation of Chris Biemann for a
german word decompounding Elasticsearch analysis plugin.
It can decompound german words like "Rechtsanwaltskanzleien" into "Recht,
anwalt, kanzlei" or "Jahresfeier" into "Jahr, feier". The best thing is,
you don't need to provide a word list.
You can always download the file and install it locally (-url file://....).
No longer a cleaner one step process, but better than changing the source
(IMHO).
This plugin is interesting to me for the German normalization so thanks
for putting it together! Any chance you could get this uploaded to the new download.elasticsearch.org service (or maven) ? I had to manually hack
the github url to get at the 1.1.0 zip file for this plugin otherwise
people would have to build manually atm.
Regards,
Bruce Ritchie
On Tuesday, November 20, 2012 6:47:54 PM UTC-5, Jörg Prante wrote:
Hi,
in my spare time this evening, while I'm still wrangling with some NLP
plugins (Stanford , UIMA, OpenNLP), and eagerly awaiting Lucene 4, I
reworked a Compact Patricia Trie implementation of Chris Biemann for a
german word decompounding Elasticsearch analysis plugin.
It can decompound german words like "Rechtsanwaltskanzleien" into "Recht,
anwalt, kanzlei" or "Jahresfeier" into "Jahr, feier". The best thing is,
you don't need to provide a word list.
thanks for your interest - right now there is no other method than
downloading with a full URL. Github will remove the ZIP files soon.
I have no access to the maven search site URL download or to
download.elasticsearch.org.
To improve the situation, I am reorganizing all my plugins now for better
distribution, more to be announced on this list. My plan is to setup a
Maven, RPM and deb distribution service at the brand new bintray.com
service.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.