Re: elasticsearch and swedish compound words

Hi Ragnar,

thanks for your interest.

My plugin does not work well with Swedish. The german and swedish
languages are close, but correct baseform reduction trees are required.
For creating dictionary tree files, a training tool must be written or
the original one must be used. If you are familiar with the GUI based
Leipzig tools at
http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/ , you could
try and find out how to build swedish compund tree dictionary files. You
need a list of all swedish compound words.

Similar to compound word analysis is hyphenation based analysis. Watch
out if there are hyphenation analysis routines for ES or Lucene. There is

http://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

and it is exposed by ES via

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/compound/HyphenationCompoundWordTokenFilterFactory.java

Jörg

Am 02.06.13 12:01, schrieb Ragnar Rova:

Hello.

I saw elasticsearch-analysis-decompound on github and it mentioned
that its specifically written for german. I was wondering how well it
would work out of the box for swedish?

The problem I have is a search for

"Silikat" should match "Silikatfärg"

Where Silikatfärg is a compound built from "Silikat" and "färg" in
swedish.

If the plugin is not a good fit, I appreciate any advice on how to
deal with swedish compounds (even using word lists).
Best regards,

Ragnar Rova

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.