Folding German characters like umlauts


(harryf) #1

Wondering how best to handle German characters like "ü".

Given a word like "Zürich", it needs to be possible to match it with both "Zurich" and "Zuerich". "Zurich" would be regarded as the "international" form that, say, an English speaker whereas "Zuerich" would been seen by a German speaker as the correct alternative. Folding from "ue" to "u" is not an option, as there can be valid words and names in German containing "ue" e.g. "dauer"

The first problem is there doesn't seem to be filter that supports the transformation from "ü" to "ue" - from experimenting, both the ASCIIFoldingFilter and the ICU folding filter support the transformation from "ü" to "u".

The second problem, assuming a filter existed for "ü" to "ue", is the need to effectively store both "Zurich" and "Zuerich" given "Zürich" as the input. Something like a multi field with different analyzers on either sub field I guess but that's likely to lead to large indexes.

How best to handle this?


(Sebastian Gavarini) #2

Hi harrryf,

I haven't done any German analysis before, but I have a couple of
ideas that I think could help.

First, I would consider to use a synonyms approach, because you want
not only the simplified ü -> u but also ü -> ue as you said. That is
accomplished at the TokenFilter level, the idea is to include the
token "zurich" and at the same position the token "zuerich", there's a
section in the book Lucene in Action, by Manning, "4.6 Synonyms,
aliases, and words that mean the same" that's worth reading. You must
decide to add synonyms at indexing time or searching time, not both.
You could do a basic "contains" at the token level to find umlauts,
and then include the synonyms.

Second, you could try a sounds like filter, like metaphone, so both
representations should end up being phonetically similar, and you
won't need synonyms.

Third and last, I found DictionaryCompoundWordTokenFilter in Lucene,
that you could consider as well, it's not exactly related to your
question but it may be good too for German, whose Javadoc says:
"A TokenFilter that decomposes compound words found in many Germanic
languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find
"Donaudampfschiff" even when you only enter "schiff".
It uses a brute-force algorithm to achieve this."

I hope that helps.

Regards,
Sebastian.

On Jan 1, 11:37 am, harryf hfue...@gmail.com wrote:

Wondering how best to handle German characters like "ü".

Given a word like "Zürich", it needs to be possible to match it with both
"Zurich" and "Zuerich". "Zurich" would be regarded as the "international"
form that, say, an English speaker whereas "Zuerich" would been seen by a
German speaker as the correct alternative. Folding from "ue" to "u" is not
an option, as there can be valid words and names in German containing "ue"
e.g. "dauer"

The first problem is there doesn't seem to be filter that supports the
transformation from "ü" to "ue" - from experimenting, both the
ASCIIFoldingFilter and the ICU folding filter support the transformation
from "ü" to "u".

The second problem, assuming a filter existed for "ü" to "ue", is the need
to effectively store both "Zurich" and "Zuerich" given "Zürich" as the
input. Something like a multi field with different analyzers on either sub
field I guess but that's likely to lead to large indexes.

How best to handle this?

View this message in context:http://elasticsearch-users.115913.n3.nabble.com/Folding-German-charac...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


#3

On Sat, Jan 1, 2011 at 9:37 AM, harryf hfuecks@gmail.com wrote:

Folding from "ue" to "u" is not
an option, as there can be valid words and names in German containing "ue"
e.g. "dauer"

You might be interested in what snowball "German2" stemmer says about
this, as its designed specifically to address this issue:
http://snowball.tartarus.org/algorithms/german2/stemmer.html

"In any case the differences are little more than one word per
thousand among the native German words."

You can obviously try some synonym approach, which will likely enlarge
your index and impact performance, but do you really want to do that
for a 0.1% improvement? Obviously if you have more proper nouns and
such it might be more than 1 word per thousand, but still.

Either way, you might want to try German2 or take a look at
incorporating their heuristic, which excludes the folding for u after
q or any vowel... so your dauer example still works fine.


(harryf) #4

Many thanks for the hint! Tried it with very positive results. Plugin is at https://github.com/harryf/elasticsearch/tree/master/plugins/analysis/snowball - needs a little more work before I can contribute back.

Minimal config to use it;

index: analysis: analyzer: default: type: custom tokenizer: whitespace filter: [snowball] filter: snowball: type: snowball language: German2

With the following script;

#!/bin/sh curl -XPUT 'http://localhost:9200/lang/test/1' -d ' { "text" : "Zürich" } ' echo ""

curl -XPUT 'http://localhost:9200/lang/test/2' -d '
{
"text" : "Zurich"
}
'
echo ""

curl -XPUT 'http://localhost:9200/lang/test/3' -d '
{
"text" : "Zuerich"
}
'
echo ""

echo "Searching for 'Zürich'"
curl -s -GET 'http://localhost:9200/lang/test/_search?q=text:Zurich' | grep 'text'

echo "Searching for 'Zurich'"
curl -s -GET 'http://localhost:9200/lang/test/_search?q=text:Zurich' | grep 'text'

echo "Searching for 'Zuerich'"
curl -s -GET 'http://localhost:9200/lang/test/_search?q=text:Zuerich' | grep 'text'

It get this output;

Searching for 'Zürich' "text" : "Zurich" "text" : "Zürich" "text" : "Zuerich"

Searching for 'Zurich'
"text" : "Zurich"
"text" : "Zürich"
"text" : "Zuerich"

Searching for 'Zuerich'
"text" : "Zurich"
"text" : "Zürich"
"text" : "Zuerich"


(ppearcy) #5

Cool, thanks for sharing back!!! Looking forward to inclusion in the
mainline, as this is very useful to get some of the more nuanced
stemming behavior.

On Jan 3, 4:45 pm, harryf hfue...@gmail.com wrote:

Many thanks for the hint! Tried it with very positive results. Plugin is athttps://github.com/harryf/elasticsearch/tree/master/plugins/analysis/...

  • needs a little more work before I can contribute back.

Minimal config to use it;

index:
analysis:
analyzer:
default:
type: custom
tokenizer: whitespace
filter: [snowball]
filter:
snowball:
type: snowball
language: German2

With the following script;

#!/bin/sh
curl -XPUT 'http://localhost:9200/lang/test/1'-d '
{
"text" : "Zürich"}

'
echo ""

curl -XPUT 'http://localhost:9200/lang/test/2'-d '
{
"text" : "Zurich"}

'
echo ""

curl -XPUT 'http://localhost:9200/lang/test/3'-d '
{
"text" : "Zuerich"}

'
echo ""

echo "Searching for 'Zürich'"
curl -s -GET 'http://localhost:9200/lang/test/_search?q=text:Zurich'| grep
'text'

echo "Searching for 'Zurich'"
curl -s -GET 'http://localhost:9200/lang/test/_search?q=text:Zurich'| grep
'text'

echo "Searching for 'Zuerich'"
curl -s -GET 'http://localhost:9200/lang/test/_search?q=text:Zuerich'| grep
'text'

It get this output;

Searching for 'Zürich'
"text" : "Zurich"
"text" : "Zürich"
"text" : "Zuerich"

Searching for 'Zurich'
"text" : "Zurich"
"text" : "Zürich"
"text" : "Zuerich"

Searching for 'Zuerich'
"text" : "Zurich"
"text" : "Zürich"
"text" : "Zuerich"

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/Folding-German-charac...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(harryf) #6

The patch for the plugin is submitted - https://github.com/elasticsearch/elasticsearch/pull/598

In the mean time also downloadable from http://goo.gl/wYoAr - extract the zip into a directory (you need to create) ./plugins/analysis-snowball


(Shay Banon) #7

Saw the pull request, thanks!. I commented there as well, but I think it
would be great to have it in the core and not as a plugin. Better OOTB
experiance.

On Wed, Jan 5, 2011 at 4:24 AM, harryf hfuecks@gmail.com wrote:

The patch for the plugin is submitted -
https://github.com/elasticsearch/elasticsearch/pull/598

In the mean time also downloadable from http://goo.gl/wYoAr - extract the
zip into a directory (you need to create) ./plugins/analysis-snowball

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Folding-German-characters-like-umlauts-tp2176078p2195930.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(harryf) #8

Now re-done as a core analyzer - new pull request at https://github.com/elasticsearch/elasticsearch/pull/606


(Shay Banon) #9

cool, thanks!. Applied the path, but also pushed a small change to not
fallover if configuring an analyzer that has no stopwords but still has a
stemmer.

On Thu, Jan 6, 2011 at 3:25 AM, harryf hfuecks@gmail.com wrote:

Now re-done as a core analyzer - new pull request at
https://github.com/elasticsearch/elasticsearch/pull/606

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Folding-German-characters-like-umlauts-tp2176078p2202843.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(CBR) #10

I tried to configure es v0.18.4 as described in #606: Snowball by harryf for elasticsearch/elasticsearch - Pull Request - GitHub and tested with the "Zürich"-Test-Script (with correct search-term in the first case). But I don't get the expected result. Instead the result is changing everytime I call the testscript.

First try:

Searching for 'Zürich'
Searching for 'Zurich'
"text" : "Zurich"
Searching for 'Zuerich'
"text" : "Zuerich"

Second try:

Searching for 'Zürich'
Searching for 'Zurich'
"text" : "Zürich"
"text" : "Zurich"
Searching for 'Zuerich'
"text" : "Zuerich"

Third try:

Searching for 'Zürich'
Searching for 'Zurich'
"text" : "Zurich"
"text" : "Zuerich"
Searching for 'Zuerich'
"text" : "Zürich"
"text" : "Zurich"
"text" : "Zuerich"

Fourth try: like the first one

Do You have any ideas?


Environment: SLES 10 and Debian 6

Current configuration:

index:
analysis:
analyzer:
default:
type: custom
tokenizer: whitespace
filter: [snowball]
filter:
snowball:
type: snowball
language: German2


Wanted configuration (but this works even less):

index:
analysis:
analyzer:
default:
type: custom
tokenizer: std_tokenizer
filter: [standard, lowercase, stop_de, stem_de]
char_filter: html_strip
tokenizer:
std_tokenizer:
type: standard
filter:
stop_de:
type: stop
stopwords: [german]
stem_de:
type: snowball
language: German2


(Shay Banon) #11

Gist your examples, see http://www.elasticsearch.org/help.

On Mon, Nov 28, 2011 at 9:22 AM, CBR christian.bieser@gmail.com wrote:

I tried to configure es v0.18.4 as described in
https://github.com/elasticsearch/elasticsearch/pull/606 #606: Snowball by
harryf for elasticsearch/elasticsearch - Pull Request - GitHub and tested
with the "Zürich"-Test-Script (with correct search-term in the first case).
But I don't get the expected result. Instead the result is changing
everytime I call the testscript.

First try:

Second try:

Third try:

Fourth try: like the first one

Do You have any ideas?


Environment: SLES 10 and Debian 6

Current configuration:


Wanted configuration (but this works even less):

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Folding-German-characters-like-umlauts-tp2176078p3541537.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(system) #12