Char_filter for German

Hi,

To handle German language in search I have to be able to provide same
results if user searches for e.g über, uber or ueber

I would do this at the index time where I would have über in the data. But
if I use just asciifolding filter I lose information that this was work
with "umlaut" and I can't get ueber token. If I use char_fiter, it is
applied before analysis and I would not be able to get uber.

Is it possible to preserve original with char filter or apply it after the
analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You may find the approach I give in the end of this talk helpful:

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <kresimir.slugan@gmail.com

wrote:

Hi,

To handle German language in search I have to be able to provide same
results if user searches for e.g über, uber or ueber

I would do this at the index time where I would have über in the data. But
if I use just asciifolding filter I lose information that this was work
with "umlaut" and I can't get ueber token. If I use char_fiter, it is
applied before analysis and I would not be able to get uber.

Is it possible to preserve original with char filter or apply it after the
analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hello Kresimir,
as a native speaker of German and a linguist, I know you usually want
to preserve the umlaut, but for searches you may want to relax the
precision of matching. So, why not do precisely this? If you have "über"
or "ueber" in your query, replace it by "über OR ueber". And if you want
to take care of those Americans who believe these two dots do not carry
any meaning at all (heavy grin at this point), you may add even "OR
uber". Syntactically, "uber" is wrong. This would only be a convenience
rule for users thinking they can simply omit umlaut dots or who are
incapable of typing umlaut characters on their keyboards.

Note: when it comes to German last names, the names Ganser, Gänser and
Gaenser would be considered three entirely different names, although the
alternative spelling (e.g., in plain e-mail addresses) of Gänser could
be Gaenser. Mapping umlauts will get you false positives.

Also be careful with the reverse. "ue", "oe" and "ae" cannot simply be
spelled as "ü", "ö" or "ä". In a word like "Zooeingang" (zoo entrance),
the composite is actually made of "Zoo" and "Eingang", so the "oe" must
not be interpreted as "ö".

Similar issues exist with "ß" and "ss".

Well, most likely these funny cases won't matter too much, so I suggest
to try with a simple disjunctive expansion for a start.

Best regards,
--Jürgen

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan
<kresimir.slugan@gmail.com mailto:kresimir.slugan@gmail.com> wrote:

 Hi,

To handle German language in search I have to be able to provide
same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the
data.  But if I use just asciifolding filter I lose information
that this was work with "umlaut" and I can't get ueber token. If I
use char_fiter, it is applied before analysis and I would not be
able to get uber. 

Is it possible to preserve original with char filter or apply it
after the analysis?

Cheers,

Kresimir
-- 
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com
<mailto:elasticsearch+unsubscribe@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
mailto:juergen.wagner@devoteam.com, URL: www.devoteam.de
http://www.devoteam.de/


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/547A052F.4010302%40devoteam.com.
For more options, visit https://groups.google.com/d/optout.

Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve
original with ASCIIfolding but since char_filter is applied
before ASCIIfolding then there would not be any umlauts to fold :slight_smile: If I
could apply char_filter on the end that would be ok, or preserve original
with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko wrote:

You may find the approach I give in the end of this talk helpful:
Approaches to multi-lingual text search with Elasticsearch and Lucene | SkillsCast | 5th March 2014

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <kresimi...@gmail.com
<javascript:>> wrote:

Hi,

To handle German language in search I have to be able to provide same
results if user searches for e.g über, uber or ueber

I would do this at the index time where I would have über in the data. But
if I use just asciifolding filter I lose information that this was work
with "umlaut" and I can't get ueber token. If I use char_fiter, it is
applied before analysis and I would not be able to get uber.

Is it possible to preserve original with char filter or apply it after
the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jürgen,

I'm aware that mapping umlauts gets many false positives, but we have
noticed that some of our users omit them while searching. I guess we'll
have to make product decision there because we can not cover all use cases
anyway.

Thanks for your response!

Best,

Kresimir

On Saturday, November 29, 2014 6:41:17 PM UTC+1, Jürgen Wagner (DVT) wrote:

Hello Kresimir,
as a native speaker of German and a linguist, I know you usually want to
preserve the umlaut, but for searches you may want to relax the precision
of matching. So, why not do precisely this? If you have "über" or "ueber"
in your query, replace it by "über OR ueber". And if you want to take care
of those Americans who believe these two dots do not carry any meaning at
all (heavy grin at this point), you may add even "OR uber". Syntactically,
"uber" is wrong. This would only be a convenience rule for users thinking
they can simply omit umlaut dots or who are incapable of typing umlaut
characters on their keyboards.

Note: when it comes to German last names, the names Ganser, Gänser and
Gaenser would be considered three entirely different names, although the
alternative spelling (e.g., in plain e-mail addresses) of Gänser could be
Gaenser. Mapping umlauts will get you false positives.

Also be careful with the reverse. "ue", "oe" and "ae" cannot simply be
spelled as "ü", "ö" or "ä". In a word like "Zooeingang" (zoo entrance), the
composite is actually made of "Zoo" and "Eingang", so the "oe" must not be
interpreted as "ö".

Similar issues exist with "ß" and "ss".

Well, most likely these funny cases won't matter too much, so I suggest to
try with a simple disjunctive expansion for a start.

Best regards,
--Jürgen

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <kresimi...@gmail.com
<javascript:>> wrote:

Hi,

To handle German language in search I have to be able to provide same
results if user searches for e.g über, uber or ueber

I would do this at the index time where I would have über in the data. But
if I use just asciifolding filter I lose information that this was work
with "umlaut" and I can't get ueber token. If I use char_fiter, it is
applied before analysis and I would not be able to get uber.

Is it possible to preserve original with char filter or apply it after
the analysis?

Cheers,

Kresimir

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen...@devoteam.com <javascript:>, URL: www.devoteam.de

Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ec79cc5f-a6e1-4fc4-8f60-7f1ab31b60ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

What I'm saying is don't use char_filter, and use the token filters chain
to achieve that

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sat, Nov 29, 2014 at 9:02 PM, Krešimir Slugan kresimir.slugan@gmail.com
wrote:

Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve
original with ASCIIfolding but since char_filter is applied
before ASCIIfolding then there would not be any umlauts to fold :slight_smile: If I
could apply char_filter on the end that would be ok, or preserve original
with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko wrote:

You may find the approach I give in the end of this talk helpful:
Approaches to multi-lingual text search with Elasticsearch and Lucene | SkillsCast | 5th March 2014
text-search-with-elasticsearch-and-lucene

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan kresimi...@gmail.com
wrote:

Hi,

To handle German language in search I have to be able to provide same
results if user searches for e.g über, uber or ueber

I would do this at the index time where I would have über in the data. But
if I use just asciifolding filter I lose information that this was work
with "umlaut" and I can't get ueber token. If I use char_fiter, it is
applied before analysis and I would not be able to get uber.

Is it possible to preserve original with char filter or apply it after
the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZtC14VTg4FQcVv5iiyuuO_nWcR4sbDGScqw9Mj5gsRWPQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Which token filter can I use to replace words like über with ueber?

On Saturday, November 29, 2014 8:16:14 PM UTC+1, Itamar Syn-Hershko wrote:

What I'm saying is don't use char_filter, and use the token filters chain
to achieve that

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sat, Nov 29, 2014 at 9:02 PM, Krešimir Slugan <kresimi...@gmail.com
<javascript:>> wrote:

Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve
original with ASCIIfolding but since char_filter is applied
before ASCIIfolding then there would not be any umlauts to fold :slight_smile: If I
could apply char_filter on the end that would be ok, or preserve original
with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko wrote:

You may find the approach I give in the end of this talk helpful:
Approaches to multi-lingual text search with Elasticsearch and Lucene | SkillsCast | 5th March 2014
text-search-with-elasticsearch-and-lucene

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan kresimi...@gmail.com
wrote:

Hi,

To handle German language in search I have to be able to provide same
results if user searches for e.g über, uber or ueber

I would do this at the index time where I would have über in the data.
But if I use just asciifolding filter I lose information that this
was work with "umlaut" and I can't get ueber token. If I use
char_fiter, it is applied before analysis and I would not be able to get
uber.

Is it possible to preserve original with char filter or apply it after
the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e3cc964-59fc-4be7-bb13-b1411a312ade%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Why do you need it as ueber? what I'm usually doing is end up with [über,
uber] at the same position, possibly marking the first as being the
original. Seeing Jurgen's response, I seem to be on the right path...

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sat, Nov 29, 2014 at 9:21 PM, Krešimir Slugan kresimir.slugan@gmail.com
wrote:

Which token filter can I use to replace words like über with ueber?

On Saturday, November 29, 2014 8:16:14 PM UTC+1, Itamar Syn-Hershko wrote:

What I'm saying is don't use char_filter, and use the token filters chain
to achieve that

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sat, Nov 29, 2014 at 9:02 PM, Krešimir Slugan kresimi...@gmail.com
wrote:

Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve
original with ASCIIfolding but since char_filter is applied
before ASCIIfolding then there would not be any umlauts to fold :slight_smile: If I
could apply char_filter on the end that would be ok, or preserve original
with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko
wrote:

You may find the approach I give in the end of this talk helpful:
Approaches to multi-lingual text search with Elasticsearch and Lucene | SkillsCast | 5th March 2014
text-search-with-elasticsearch-and-lucene

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <kresimi...@gmail.com

wrote:

Hi,

To handle German language in search I have to be able to provide same
results if user searches for e.g über, uber or ueber

I would do this at the index time where I would have über in the
data. But if I use just asciifolding filter I lose information that
this was work with "umlaut" and I can't get ueber token. If I use
char_fiter, it is applied before analysis and I would not be able to get
uber.

Is it possible to preserve original with char filter or apply it after
the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8e3cc964-59fc-4be7-bb13-b1411a312ade%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8e3cc964-59fc-4be7-bb13-b1411a312ade%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuvKNq58xryBXJ5FLewOafWd0LvsaTADh%2BeYCtHGaRK2A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Because, as far as I understand, in German it's semantically the same to
write über or ueber (although ueber is less often used). I guess this is
not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.

On Sat, Nov 29, 2014 at 8:29 PM, Itamar Syn-Hershko itamar@code972.com
wrote:

Why do you need it as ueber? what I'm usually doing is end up with [über,
uber] at the same position, possibly marking the first as being the
original. Seeing Jurgen's response, I seem to be on the right path...

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sat, Nov 29, 2014 at 9:21 PM, Krešimir Slugan <
kresimir.slugan@gmail.com> wrote:

Which token filter can I use to replace words like über with ueber?

On Saturday, November 29, 2014 8:16:14 PM UTC+1, Itamar Syn-Hershko wrote:

What I'm saying is don't use char_filter, and use the token filters
chain to achieve that

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Sat, Nov 29, 2014 at 9:02 PM, Krešimir Slugan kresimi...@gmail.com
wrote:

Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve
original with ASCIIfolding but since char_filter is applied
before ASCIIfolding then there would not be any umlauts to fold :slight_smile: If I
could apply char_filter on the end that would be ok, or preserve original
with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko
wrote:

You may find the approach I give in the end of this talk helpful:
Approaches to multi-lingual text search with Elasticsearch and Lucene | SkillsCast | 5th March 2014
text-search-with-elasticsearch-and-lucene

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <
kresimi...@gmail.com> wrote:

Hi,

To handle German language in search I have to be able to provide same
results if user searches for e.g über, uber or ueber

I would do this at the index time where I would have über in the
data. But if I use just asciifolding filter I lose information that
this was work with "umlaut" and I can't get ueber token. If I use
char_fiter, it is applied before analysis and I would not be able to get
uber.

Is it possible to preserve original with char filter or apply it
after the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8e3cc964-59fc-4be7-bb13-b1411a312ade%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8e3cc964-59fc-4be7-bb13-b1411a312ade%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/B-JO9993Avo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuvKNq58xryBXJ5FLewOafWd0LvsaTADh%2BeYCtHGaRK2A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuvKNq58xryBXJ5FLewOafWd0LvsaTADh%2BeYCtHGaRK2A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAK4NRa%3DeXOeYcANXC71qvXLyK8RG%3D4L5ijbNXPO9bwdig3yD%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Krešimir,
the correct term is "über" (over, above) or "hören" (hear) or "ändern"
(change). When you cannot write umlauts, the correct alternative
spelling in print is "ueber", "hoeren", "aendern". Everybody can write
this in ASCII. However, those who are possibly non-speakers of German
who still want to search for German terms are usually not aware of this
and believe it's like with accents in French, where "á" is lexically
treated like "a". Those users are wrong in spelling "uber", "horen",
"andern" because "u" and "ü" are in fact different letters. It's like
"ll" in Spanish. "ll" is ONE letter :slight_smile:

However, in order to provide a convenience to those users as well, you
could decide that - to yield at least some meaningful results - you will
also consider the versions without the umlaut dots equivalent. In that
case, you want to map any token containing an umlaut (ä, ö, ü) to three
alternatives: umlaut, without umlaut marker, alternative spelling with
'e'. This won't let you distinguish between the "Bar" (bar, the place to
get a drink) and "Bär" (bear, the one giving you a great, dangerous
hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation,
promotion, extraction [geol.]) are also quite different, just to give a
few examples.

For the proper recognition of those terms, you would normally use a
dictionary of German, including some frequent proper names as well. So,
if you look for "clown boll", you would not only get "Der Clown im
Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll,
Ansichten eines Clowns", because the query would be transformed into
"clown AND (boll OR boell OR böll)" as "boll" matches an umlaut
candidate in your dictionary. If you dare to normalize your indexed
texts, so "Boell" would already have been turned into "Böll", you could
even do with a disjunction of only the one correct form and the
misspelling. Again, however, you would make use of a dictionary to
perform such normalization. Ideally, you would even have a POS tagger in
place, so you would only make such replacements where the name Böll is
referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If
you just want to index some German text, maybe you just want to turn all
umlauts into the plain vocals for the purpose of indexing, but still
keep the reference to the original for result display. Maybe that's
sufficient. For larger volumes of documents, a more precise approach is
recommended to avoid false positives.

Cheers,
--Jürgen

On 29.11.2014 20:35, Krešimir Slugan wrote:

Because, as far as I understand, in German it's semantically the same
to write über or ueber (although ueber is less often used). I guess
this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
mailto:juergen.wagner@devoteam.com, URL: www.devoteam.de
http://www.devoteam.de/


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/547A4766.50103%40devoteam.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to
yield more results in hope that proper ones would still be shown in the
top. In future, when we have more data, we'll have to sacrifice some use
cases in order to provide more precise results for the rest of users.

I think I will try regexp token approach to replace umlauts with "e" forms
to solve this double expansion problem.

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) wrote:

Hi Krešimir,
the correct term is "über" (over, above) or "hören" (hear) or "ändern"
(change). When you cannot write umlauts, the correct alternative spelling
in print is "ueber", "hoeren", "aendern". Everybody can write this in
ASCII. However, those who are possibly non-speakers of German who still
want to search for German terms are usually not aware of this and believe
it's like with accents in French, where "á" is lexically treated like "a".
Those users are wrong in spelling "uber", "horen", "andern" because "u" and
"ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
letter :slight_smile:

However, in order to provide a convenience to those users as well, you
could decide that - to yield at least some meaningful results - you will
also consider the versions without the umlaut dots equivalent. In that
case, you want to map any token containing an umlaut (ä, ö, ü) to three
alternatives: umlaut, without umlaut marker, alternative spelling with 'e'.
This won't let you distinguish between the "Bar" (bar, the place to get a
drink) and "Bär" (bear, the one giving you a great, dangerous hug).
"Forderung" (demand) and "Förderung" (encouragement, facilitation,
promotion, extraction [geol.]) are also quite different, just to give a few
examples.

For the proper recognition of those terms, you would normally use a
dictionary of German, including some frequent proper names as well. So, if
you look for "clown boll", you would not only get "Der Clown im Advent -
Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
Clowns", because the query would be transformed into "clown AND (boll OR
boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
If you dare to normalize your indexed texts, so "Boell" would already have
been turned into "Böll", you could even do with a disjunction of only the
one correct form and the misspelling. Again, however, you would make use of
a dictionary to perform such normalization. Ideally, you would even have a
POS tagger in place, so you would only make such replacements where the
name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If
you just want to index some German text, maybe you just want to turn all
umlauts into the plain vocals for the purpose of indexing, but still keep
the reference to the original for result display. Maybe that's sufficient.
For larger volumes of documents, a more precise approach is recommended to
avoid false positives.

Cheers,
--Jürgen

On 29.11.2014 20:35, Krešimir Slugan wrote:

Because, as far as I understand, in German it's semantically the same to
write über or ueber (although ueber is less often used). I guess this is
not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen...@devoteam.com <javascript:>, URL: www.devoteam.de

Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

{
"index" : {
"analysis" : {
"filter" : {
"german_normalize_stem" : {
"type" : "snowball",
"name" : "German2"
}
},
"analyzer" : {
"stemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize_stem"
]
},
"unstemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize"
]
}
}
}
}
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in
words.

You can choose between stemmed and unstemmed analysis. Snowball tends to
overstem words. The "german_normalize" token filter is copied from Snowball
but works without stemming.

The effect of the combination is that all german words like Jörg, Joerg,
Jorg are reduced to jorg in the index.

Best,

Jörg

On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <kresimir.slugan@gmail.com

wrote:

Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to
yield more results in hope that proper ones would still be shown in the
top. In future, when we have more data, we'll have to sacrifice some use
cases in order to provide more precise results for the rest of users.

I think I will try regexp token approach to replace umlauts with "e" forms
to solve this double expansion problem.

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT)
wrote:

Hi Krešimir,
the correct term is "über" (over, above) or "hören" (hear) or "ändern"
(change). When you cannot write umlauts, the correct alternative spelling
in print is "ueber", "hoeren", "aendern". Everybody can write this in
ASCII. However, those who are possibly non-speakers of German who still
want to search for German terms are usually not aware of this and believe
it's like with accents in French, where "á" is lexically treated like "a".
Those users are wrong in spelling "uber", "horen", "andern" because "u" and
"ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
letter :slight_smile:

However, in order to provide a convenience to those users as well, you
could decide that - to yield at least some meaningful results - you will
also consider the versions without the umlaut dots equivalent. In that
case, you want to map any token containing an umlaut (ä, ö, ü) to three
alternatives: umlaut, without umlaut marker, alternative spelling with 'e'.
This won't let you distinguish between the "Bar" (bar, the place to get a
drink) and "Bär" (bear, the one giving you a great, dangerous hug).
"Forderung" (demand) and "Förderung" (encouragement, facilitation,
promotion, extraction [geol.]) are also quite different, just to give a few
examples.

For the proper recognition of those terms, you would normally use a
dictionary of German, including some frequent proper names as well. So, if
you look for "clown boll", you would not only get "Der Clown im Advent -
Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
Clowns", because the query would be transformed into "clown AND (boll OR
boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
If you dare to normalize your indexed texts, so "Boell" would already have
been turned into "Böll", you could even do with a disjunction of only the
one correct form and the misspelling. Again, however, you would make use of
a dictionary to perform such normalization. Ideally, you would even have a
POS tagger in place, so you would only make such replacements where the
name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If
you just want to index some German text, maybe you just want to turn all
umlauts into the plain vocals for the purpose of indexing, but still keep
the reference to the original for result display. Maybe that's sufficient.
For larger volumes of documents, a more precise approach is recommended to
avoid false positives.

Cheers,
--Jürgen

On 29.11.2014 20:35, Krešimir Slugan wrote:

Because, as far as I understand, in German it's semantically the same to
write über or ueber (although ueber is less often used). I guess this is
not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen...@devoteam.com, URL: www.devoteam.de

Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFQyXjBrDwBbq44xHYn6aXCkGADMUfGDyutJLjNoLrWYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

I'm in the preliminary stages for implementing Elasticsearch so I'm interested in this, too.

What about mixed languages or where I even don't know the language? My data are emails. Therefore the data could be any language.

On 30.11.2014, at 17:20, joergprante@gmail.com mailto:joergprante@gmail.com wrote:

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

Mit freundlichen Grüßen/Regards

Trixi Willius

http://www.mothsoftware.com http://www.mothsoftware.com/
Mail Archiver X: The email archiving solution for professionals

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CF22AA2A-90A9-4B64-9115-9193E9C01C9E%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

by using my langdetect plugin, and the analyzer-by-path selection of
Elasticsearch, it is possible to analyze input by detected language.

See

With "copy_to", the many analyzed fields could be merged together into a
general field (similar to _all) for convenient search.

Jörg

On Sun, Nov 30, 2014 at 5:49 PM, Beatrix Willius beatrixwillius1@gmail.com
wrote:

Hi,

I'm in the preliminary stages for implementing Elasticsearch so I'm
interested in this, too.

What about mixed languages or where I even don't know the language? My
data are emails. Therefore the data could be any language.

On 30.11.2014, at 17:20, joergprante@gmail.com wrote:

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

Mit freundlichen Grüßen/Regards

Trixi Willius

http://www.mothsoftware.com
Mail Archiver X: The email archiving solution for professionals

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CF22AA2A-90A9-4B64-9115-9193E9C01C9E%40gmail.com
https://groups.google.com/d/msgid/elasticsearch/CF22AA2A-90A9-4B64-9115-9193E9C01C9E%40gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH82ffbPSbgCxTP3js4MoWEEQJOyqZOafYmWiExq1%3DUQg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hello Jörg,

could you maybe share the configuration for the german_normalize analyzer
without stemming? I actually only need the umlaut expansion. And what do
you mean by "at the right places in words" for snowball?

Thanks!
Andrej

Am Sonntag, 30. November 2014 17:20:16 UTC+1 schrieb Jörg Prante:

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

{
"index" : {
"analysis" : {
"filter" : {
"german_normalize_stem" : {
"type" : "snowball",
"name" : "German2"
}
},
"analyzer" : {
"stemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize_stem"
]
},
"unstemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize"
]
}
}
}
}
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in
words.

You can choose between stemmed and unstemmed analysis. Snowball tends to
overstem words. The "german_normalize" token filter is copied from Snowball
but works without stemming.

The effect of the combination is that all german words like Jörg, Joerg,
Jorg are reduced to jorg in the index.

Best,

Jörg

On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <kresimi...@gmail.com
<javascript:>> wrote:

Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to
yield more results in hope that proper ones would still be shown in the
top. In future, when we have more data, we'll have to sacrifice some use
cases in order to provide more precise results for the rest of users.

I think I will try regexp token approach to replace umlauts with "e"
forms to solve this double expansion problem.

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT)
wrote:

Hi Krešimir,
the correct term is "über" (over, above) or "hören" (hear) or "ändern"
(change). When you cannot write umlauts, the correct alternative spelling
in print is "ueber", "hoeren", "aendern". Everybody can write this in
ASCII. However, those who are possibly non-speakers of German who still
want to search for German terms are usually not aware of this and believe
it's like with accents in French, where "á" is lexically treated like "a".
Those users are wrong in spelling "uber", "horen", "andern" because "u" and
"ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
letter :slight_smile:

However, in order to provide a convenience to those users as well, you
could decide that - to yield at least some meaningful results - you will
also consider the versions without the umlaut dots equivalent. In that
case, you want to map any token containing an umlaut (ä, ö, ü) to three
alternatives: umlaut, without umlaut marker, alternative spelling with 'e'.
This won't let you distinguish between the "Bar" (bar, the place to get a
drink) and "Bär" (bear, the one giving you a great, dangerous hug).
"Forderung" (demand) and "Förderung" (encouragement, facilitation,
promotion, extraction [geol.]) are also quite different, just to give a few
examples.

For the proper recognition of those terms, you would normally use a
dictionary of German, including some frequent proper names as well. So, if
you look for "clown boll", you would not only get "Der Clown im Advent -
Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
Clowns", because the query would be transformed into "clown AND (boll OR
boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
If you dare to normalize your indexed texts, so "Boell" would already have
been turned into "Böll", you could even do with a disjunction of only the
one correct form and the misspelling. Again, however, you would make use of
a dictionary to perform such normalization. Ideally, you would even have a
POS tagger in place, so you would only make such replacements where the
name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If
you just want to index some German text, maybe you just want to turn all
umlauts into the plain vocals for the purpose of indexing, but still keep
the reference to the original for result display. Maybe that's sufficient.
For larger volumes of documents, a more precise approach is recommended to
avoid false positives.

Cheers,
--Jürgen

On 29.11.2014 20:35, Krešimir Slugan wrote:

Because, as far as I understand, in German it's semantically the same to
write über or ueber (although ueber is less often used). I guess this is
not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
1543
E-Mail: juergen...@devoteam.com, URL: www.devoteam.de

Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b7484e8-5752-4bf4-878f-342abadbc5d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Where is this "german_normalize" filter coming from? It solves my problem
completely and magically but it's not documented anywhere (and seems like
it's not part of ICU plugin either).

What is also weird is that filter can not be used in global context, e.g.
it's not possible to try something like this:

curl -XGET
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
-d 'this is a test'

but it is possible to use it in index context:

curl -XGET
'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
-d 'this is a test'

In first case I get "ElasticsearchIllegalArgumentException[failed to find
global token filter under [german_normalize]]
"

On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

{
"index" : {
"analysis" : {
"filter" : {
"german_normalize_stem" : {
"type" : "snowball",
"name" : "German2"
}
},
"analyzer" : {
"stemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize_stem"
]
},
"unstemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize"
]
}
}
}
}
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in
words.

You can choose between stemmed and unstemmed analysis. Snowball tends to
overstem words. The "german_normalize" token filter is copied from Snowball
but works without stemming.

The effect of the combination is that all german words like Jörg, Joerg,
Jorg are reduced to jorg in the index.

Best,

Jörg

On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <kresimi...@gmail.com
<javascript:>> wrote:

Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to
yield more results in hope that proper ones would still be shown in the
top. In future, when we have more data, we'll have to sacrifice some use
cases in order to provide more precise results for the rest of users.

I think I will try regexp token approach to replace umlauts with "e"
forms to solve this double expansion problem.

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT)
wrote:

Hi Krešimir,
the correct term is "über" (over, above) or "hören" (hear) or "ändern"
(change). When you cannot write umlauts, the correct alternative spelling
in print is "ueber", "hoeren", "aendern". Everybody can write this in
ASCII. However, those who are possibly non-speakers of German who still
want to search for German terms are usually not aware of this and believe
it's like with accents in French, where "á" is lexically treated like "a".
Those users are wrong in spelling "uber", "horen", "andern" because "u" and
"ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
letter :slight_smile:

However, in order to provide a convenience to those users as well, you
could decide that - to yield at least some meaningful results - you will
also consider the versions without the umlaut dots equivalent. In that
case, you want to map any token containing an umlaut (ä, ö, ü) to three
alternatives: umlaut, without umlaut marker, alternative spelling with 'e'.
This won't let you distinguish between the "Bar" (bar, the place to get a
drink) and "Bär" (bear, the one giving you a great, dangerous hug).
"Forderung" (demand) and "Förderung" (encouragement, facilitation,
promotion, extraction [geol.]) are also quite different, just to give a few
examples.

For the proper recognition of those terms, you would normally use a
dictionary of German, including some frequent proper names as well. So, if
you look for "clown boll", you would not only get "Der Clown im Advent -
Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
Clowns", because the query would be transformed into "clown AND (boll OR
boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
If you dare to normalize your indexed texts, so "Boell" would already have
been turned into "Böll", you could even do with a disjunction of only the
one correct form and the misspelling. Again, however, you would make use of
a dictionary to perform such normalization. Ideally, you would even have a
POS tagger in place, so you would only make such replacements where the
name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If
you just want to index some German text, maybe you just want to turn all
umlauts into the plain vocals for the purpose of indexing, but still keep
the reference to the original for result display. Maybe that's sufficient.
For larger volumes of documents, a more precise approach is recommended to
avoid false positives.

Cheers,
--Jürgen

On 29.11.2014 20:35, Krešimir Slugan wrote:

Because, as far as I understand, in German it's semantically the same to
write über or ueber (although ueber is less often used). I guess this is
not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
1543
E-Mail: juergen...@devoteam.com, URL: www.devoteam.de

Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Use "german_normalization"

"german_normalize" is the same filter I implemented in my plugin
https://github.com/jprante/elasticsearch-analysis-german/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/GermanAnalysisBinderProcessor.java
when it was not available in ES core.

Jörg

On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan kresimir.slugan@gmail.com
wrote:

Where is this "german_normalize" filter coming from? It solves my problem
completely and magically but it's not documented anywhere (and seems like
it's not part of ICU plugin either).

What is also weird is that filter can not be used in global context, e.g.
it's not possible to try something like this:

curl -XGET
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
-d 'this is a test'

but it is possible to use it in index context:

curl -XGET
'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
-d 'this is a test'

In first case I get "ElasticsearchIllegalArgumentException[failed to
find global token filter under [german_normalize]]
"

On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis
setting

{
"index" : {
"analysis" : {
"filter" : {
"german_normalize_stem" : {
"type" : "snowball",
"name" : "German2"
}
},
"analyzer" : {
"stemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize_stem"
]
},
"unstemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize"
]
}
}
}
}
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in
words.

You can choose between stemmed and unstemmed analysis. Snowball tends to
overstem words. The "german_normalize" token filter is copied from Snowball
but works without stemming.

The effect of the combination is that all german words like Jörg, Joerg,
Jorg are reduced to jorg in the index.

Best,

Jörg

On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan kresimi...@gmail.com
wrote:

Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to
yield more results in hope that proper ones would still be shown in the
top. In future, when we have more data, we'll have to sacrifice some use
cases in order to provide more precise results for the rest of users.

I think I will try regexp token approach to replace umlauts with "e"
forms to solve this double expansion problem.

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT)
wrote:

Hi Krešimir,
the correct term is "über" (over, above) or "hören" (hear) or
"ändern" (change). When you cannot write umlauts, the correct alternative
spelling in print is "ueber", "hoeren", "aendern". Everybody can write this
in ASCII. However, those who are possibly non-speakers of German who still
want to search for German terms are usually not aware of this and believe
it's like with accents in French, where "á" is lexically treated like "a".
Those users are wrong in spelling "uber", "horen", "andern" because "u" and
"ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
letter :slight_smile:

However, in order to provide a convenience to those users as well, you
could decide that - to yield at least some meaningful results - you will
also consider the versions without the umlaut dots equivalent. In that
case, you want to map any token containing an umlaut (ä, ö, ü) to three
alternatives: umlaut, without umlaut marker, alternative spelling with 'e'.
This won't let you distinguish between the "Bar" (bar, the place to get a
drink) and "Bär" (bear, the one giving you a great, dangerous hug).
"Forderung" (demand) and "Förderung" (encouragement, facilitation,
promotion, extraction [geol.]) are also quite different, just to give a few
examples.

For the proper recognition of those terms, you would normally use a
dictionary of German, including some frequent proper names as well. So, if
you look for "clown boll", you would not only get "Der Clown im Advent -
Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
Clowns", because the query would be transformed into "clown AND (boll OR
boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
If you dare to normalize your indexed texts, so "Boell" would already have
been turned into "Böll", you could even do with a disjunction of only the
one correct form and the misspelling. Again, however, you would make use of
a dictionary to perform such normalization. Ideally, you would even have a
POS tagger in place, so you would only make such replacements where the
name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If
you just want to index some German text, maybe you just want to turn all
umlauts into the plain vocals for the purpose of indexing, but still keep
the reference to the original for result display. Maybe that's sufficient.
For larger volumes of documents, a more precise approach is recommended to
avoid false positives.

Cheers,
--Jürgen

On 29.11.2014 20:35, Krešimir Slugan wrote:

Because, as far as I understand, in German it's semantically the same
to write über or ueber (although ueber is less often used). I guess this is
not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
1543
E-Mail: juergen...@devoteam.com, URL: www.devoteam.de

Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEAM7q2c5Xe%3DMRyWwiy73rnB5ur--8xzF1BXDg-m9kQYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks!

I assume that "german_normalize" is also part of Decompounder Analysis
Plugin ( GitHub - jprante/elasticsearch-analysis-decompound: Decompounding Plugin for Elasticsearch )
since that is the only analysis plugin we have installed?

Btw. "german_normalization" doesn't seems to be available for our ES
version (1.2), would you recommend upgrading instead of using
"german_normalize"?

Best,

Kresimir

On Wednesday, March 11, 2015 at 5:31:40 PM UTC+1, Jörg Prante wrote:

Use "german_normalization"

"german_normalize" is the same filter I implemented in my plugin
https://github.com/jprante/elasticsearch-analysis-german/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/GermanAnalysisBinderProcessor.java
when it was not available in ES core.

Jörg

On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <kresimi...@gmail.com
<javascript:>> wrote:

Where is this "german_normalize" filter coming from? It solves my problem
completely and magically but it's not documented anywhere (and seems like
it's not part of ICU plugin either).

What is also weird is that filter can not be used in global context, e.g.
it's not possible to try something like this:

curl -XGET
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
-d 'this is a test'

but it is possible to use it in index context:

curl -XGET
'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
-d 'this is a test'

In first case I get "ElasticsearchIllegalArgumentException[failed to
find global token filter under [german_normalize]]
"

On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis
setting

{
"index" : {
"analysis" : {
"filter" : {
"german_normalize_stem" : {
"type" : "snowball",
"name" : "German2"
}
},
"analyzer" : {
"stemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize_stem"
]
},
"unstemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize"
]
}
}
}
}
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in
words.

You can choose between stemmed and unstemmed analysis. Snowball tends to
overstem words. The "german_normalize" token filter is copied from Snowball
but works without stemming.

The effect of the combination is that all german words like Jörg,
Joerg, Jorg are reduced to jorg in the index.

Best,

Jörg

On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan kresimi...@gmail.com
wrote:

Hi Jürgen,

Currently we don't have big volumes of data to index so we would like
to yield more results in hope that proper ones would still be shown in the
top. In future, when we have more data, we'll have to sacrifice some use
cases in order to provide more precise results for the rest of users.

I think I will try regexp token approach to replace umlauts with "e"
forms to solve this double expansion problem.

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT)
wrote:

Hi Krešimir,
the correct term is "über" (over, above) or "hören" (hear) or
"ändern" (change). When you cannot write umlauts, the correct alternative
spelling in print is "ueber", "hoeren", "aendern". Everybody can write this
in ASCII. However, those who are possibly non-speakers of German who still
want to search for German terms are usually not aware of this and believe
it's like with accents in French, where "á" is lexically treated like "a".
Those users are wrong in spelling "uber", "horen", "andern" because "u" and
"ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
letter :slight_smile:

However, in order to provide a convenience to those users as well,
you could decide that - to yield at least some meaningful results - you
will also consider the versions without the umlaut dots equivalent. In that
case, you want to map any token containing an umlaut (ä, ö, ü) to three
alternatives: umlaut, without umlaut marker, alternative spelling with 'e'.
This won't let you distinguish between the "Bar" (bar, the place to get a
drink) and "Bär" (bear, the one giving you a great, dangerous hug).
"Forderung" (demand) and "Förderung" (encouragement, facilitation,
promotion, extraction [geol.]) are also quite different, just to give a few
examples.

For the proper recognition of those terms, you would normally use a
dictionary of German, including some frequent proper names as well. So, if
you look for "clown boll", you would not only get "Der Clown im Advent -
Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
Clowns", because the query would be transformed into "clown AND (boll OR
boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
If you dare to normalize your indexed texts, so "Boell" would already have
been turned into "Böll", you could even do with a disjunction of only the
one correct form and the misspelling. Again, however, you would make use of
a dictionary to perform such normalization. Ideally, you would even have a
POS tagger in place, so you would only make such replacements where the
name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application.
If you just want to index some German text, maybe you just want to turn all
umlauts into the plain vocals for the purpose of indexing, but still keep
the reference to the original for result display. Maybe that's sufficient.
For larger volumes of documents, a more precise approach is recommended to
avoid false positives.

Cheers,
--Jürgen

On 29.11.2014 20:35, Krešimir Slugan wrote:

Because, as far as I understand, in German it's semantically the same
to write über or ueber (although ueber is less often used). I guess this is
not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this
also.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
1543
E-Mail: juergen...@devoteam.com, URL: www.devoteam.de

Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yes, please upgrade Elasticsearch to use the official german normalizer.

I added it to decompound plugin for convenience, it may be removed at any
later time.

Jörg

On Wed, Mar 11, 2015 at 9:54 PM, Krešimir Slugan kresimir.slugan@gmail.com
wrote:

Thanks!

I assume that "german_normalize" is also part of Decompounder Analysis
Plugin ( GitHub - jprante/elasticsearch-analysis-decompound: Decompounding Plugin for Elasticsearch )
since that is the only analysis plugin we have installed?

Btw. "german_normalization" doesn't seems to be available for our ES
version (1.2), would you recommend upgrading instead of using
"german_normalize"?

Best,

Kresimir

On Wednesday, March 11, 2015 at 5:31:40 PM UTC+1, Jörg Prante wrote:

Use "german_normalization"

"german_normalize" is the same filter I implemented in my plugin
https://github.com/jprante/elasticsearch-analysis-german/
blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/
GermanAnalysisBinderProcessor.java when it was not available in ES core.

Jörg

On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan kresimi...@gmail.com
wrote:

Where is this "german_normalize" filter coming from? It solves my
problem completely and magically but it's not documented anywhere (and
seems like it's not part of ICU plugin either).

What is also weird is that filter can not be used in global context,
e.g. it's not possible to try something like this:

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=
lowercase,german_normalize' -d 'this is a test'

but it is possible to use it in index context:

curl -XGET 'localhost:9200/test_index/_analyze?tokenizer=whitespace&
filters=lowercase,german_normalize' -d 'this is a test'

In first case I get "ElasticsearchIllegalArgumentException[failed to
find global token filter under [german_normalize]]
"

On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis
setting

{
"index" : {
"analysis" : {
"filter" : {
"german_normalize_stem" : {
"type" : "snowball",
"name" : "German2"
}
},
"analyzer" : {
"stemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize_stem"
]
},
"unstemmed" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"icu_normalizer",
"icu_folding",
"german_normalize"
]
}
}
}
}
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in
words.

You can choose between stemmed and unstemmed analysis. Snowball tends
to overstem words. The "german_normalize" token filter is copied from
Snowball but works without stemming.

The effect of the combination is that all german words like Jörg,
Joerg, Jorg are reduced to jorg in the index.

Best,

Jörg

On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <kresimi...@gmail.com

wrote:

Hi Jürgen,

Currently we don't have big volumes of data to index so we would like
to yield more results in hope that proper ones would still be shown in the
top. In future, when we have more data, we'll have to sacrifice some use
cases in order to provide more precise results for the rest of users.

I think I will try regexp token approach to replace umlauts with "e"
forms to solve this double expansion problem.

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT)
wrote:

Hi Krešimir,
the correct term is "über" (over, above) or "hören" (hear) or
"ändern" (change). When you cannot write umlauts, the correct alternative
spelling in print is "ueber", "hoeren", "aendern". Everybody can write this
in ASCII. However, those who are possibly non-speakers of German who still
want to search for German terms are usually not aware of this and believe
it's like with accents in French, where "á" is lexically treated like "a".
Those users are wrong in spelling "uber", "horen", "andern" because "u" and
"ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
letter :slight_smile:

However, in order to provide a convenience to those users as well,
you could decide that - to yield at least some meaningful results - you
will also consider the versions without the umlaut dots equivalent. In that
case, you want to map any token containing an umlaut (ä, ö, ü) to three
alternatives: umlaut, without umlaut marker, alternative spelling with 'e'.
This won't let you distinguish between the "Bar" (bar, the place to get a
drink) and "Bär" (bear, the one giving you a great, dangerous hug).
"Forderung" (demand) and "Förderung" (encouragement, facilitation,
promotion, extraction [geol.]) are also quite different, just to give a few
examples.

For the proper recognition of those terms, you would normally use a
dictionary of German, including some frequent proper names as well. So, if
you look for "clown boll", you would not only get "Der Clown im Advent -
Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
Clowns", because the query would be transformed into "clown AND (boll OR
boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
If you dare to normalize your indexed texts, so "Boell" would already have
been turned into "Böll", you could even do with a disjunction of only the
one correct form and the misspelling. Again, however, you would make use of
a dictionary to perform such normalization. Ideally, you would even have a
POS tagger in place, so you would only make such replacements where the
name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application.
If you just want to index some German text, maybe you just want to turn all
umlauts into the plain vocals for the purpose of indexing, but still keep
the reference to the original for result display. Maybe that's sufficient.
For larger volumes of documents, a more precise approach is recommended to
avoid false positives.

Cheers,
--Jürgen

On 29.11.2014 20:35, Krešimir Slugan wrote:

Because, as far as I understand, in German it's semantically the same
to write über or ueber (although ueber is less often used). I guess this is
not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this
also.

--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
1543
E-Mail: juergen...@devoteam.com, URL: www.devoteam.de

Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFdxomzMhbZT8Grr4c9fUqrb4v0UA9v6EYmxBPBKCf%3D0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.