Hi harrryf,
I haven't done any German analysis before, but I have a couple of
ideas that I think could help.
First, I would consider to use a synonyms approach, because you want
not only the simplified ü -> u but also ü -> ue as you said. That is
accomplished at the TokenFilter level, the idea is to include the
token "zurich" and at the same position the token "zuerich", there's a
section in the book Lucene in Action, by Manning, "4.6 Synonyms,
aliases, and words that mean the same" that's worth reading. You must
decide to add synonyms at indexing time or searching time, not both.
You could do a basic "contains" at the token level to find umlauts,
and then include the synonyms.
Second, you could try a sounds like filter, like metaphone, so both
representations should end up being phonetically similar, and you
won't need synonyms.
Third and last, I found DictionaryCompoundWordTokenFilter in Lucene,
that you could consider as well, it's not exactly related to your
question but it may be good too for German, whose Javadoc says:
"A TokenFilter that decomposes compound words found in many Germanic
languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find
"Donaudampfschiff" even when you only enter "schiff".
It uses a brute-force algorithm to achieve this."
I hope that helps.
Regards,
Sebastian.
On Jan 1, 11:37 am, harryf hfue...@gmail.com wrote:
Wondering how best to handle German characters like "ü".
Given a word like "Zürich", it needs to be possible to match it with both
"Zurich" and "Zuerich". "Zurich" would be regarded as the "international"
form that, say, an English speaker whereas "Zuerich" would been seen by a
German speaker as the correct alternative. Folding from "ue" to "u" is not
an option, as there can be valid words and names in German containing "ue"
e.g. "dauer"
The first problem is there doesn't seem to be filter that supports the
transformation from "ü" to "ue" - from experimenting, both the
ASCIIFoldingFilter and the ICU folding filter support the transformation
from "ü" to "u".
The second problem, assuming a filter existed for "ü" to "ue", is the need
to effectively store both "Zurich" and "Zuerich" given "Zürich" as the
input. Something like a multi field with different analyzers on either sub
field I guess but that's likely to lead to large indexes.
How best to handle this?
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/Folding-German-charac...
Sent from the Elasticsearch Users mailing list archive at Nabble.com.