Search with stemming and stopwords (german)

Hello,

I'm new to ElasticSearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

  1. Searching for "Maler" (german for "male painter") should also find
    "Malerin" (german for "female painter").
  2. Searching for "Malerin" (german for "female painter") should also find
    "Maler" (german for "male painter").
  3. Searching for "und" (german for "and") should find nothing (stopwords
    should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing index
(by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

The 3) requirement is simple. You can use German Stop filter
http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-tokenfilter.htmlto
filter out stop words and it can be applied during indexing (if you want to
reduce index size) as well as during searching (if you still want to index
"und" but don't want to be able to find it). It's more complicated with 1)
and 2). Typically, stemming is done using snowball analyzer.
But, unfortunately, for German it wouldn't convert Malerin and Maler into
the same term. If you have a limited number of terms that you would like to
translate you can try using Synonym Filterhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html.
You will need to apply it both during indexing as well as during searching.
Here is an example that might help you to get started:

On Wednesday, November 14, 2012 6:55:28 AM UTC-5, Kostiantyn Kahanskyi
wrote:

Hello,

I'm new to ElasticSearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

  1. Searching for "Maler" (german for "male painter") should also find
    "Malerin" (german for "female painter").
  2. Searching for "Malerin" (german for "female painter") should also find
    "Maler" (german for "male painter").
  3. Searching for "und" (german for "and") should find nothing (stopwords
    should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing index
(by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

Hi Kostiantyn,

Stemming, at the Lucene analysis stage, can be algorithmic or
dictionary-based. Most Lucene preconfigure stemming, like german stemming,
is algorithmic. The algorithm is based on the report "A Fast and Simple *
Stemming* Algorithm for German Words" by Jörg Caumanns.
*
*
*But, a quick test with curl reveals *

curl 'localhost:9200/_analyze?text=und&analyzer=german&pretty'
{
"tokens" : [ ]
}

So, german stop words work well. No tokens for "und" will get into the
index.

curl 'localhost:9200/_analyze?text=malerin&analyzer=german&pretty'

{
"tokens" : [ {
"token" : "malerin",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
} ]
}

So you are out of luck with the standard Lucene german stemming, because
"malerin" will be "malerin", "maler" will be "mal".

I know it does not help you, but here are my suggestions.

  • build a synonym filter list (kind of a dictionary) for the genus nouns
    you have (easy, but tedious, it depends how many entries you want to manage
    and how often they change)
  • build your own stemmer that can detect the genus of nouns in german (hard)
  • or drop stemming, and get a better linguistic method to detect word
    lemmas, e.g. OpenNLP or UIMA, and build Lucene token filters for lemma
    processing (semi-hard)
  • and build ES plugins for advanced lemmatization (easy after having Lucene
    token filters implemented)

If you can wait, I am working on such plugins, because I'm very keen about
German language processing with Elasticsearch, for example with OpenNLP,
UIMA, and the Stanford POS tagger. But, the plugins are not released yet,
I'm stuck with the token filters and the payloads a little bit. I am
curious to find out what Lucene 4 will bring to improve natural language
processing.

Best regards,

Jörg

--

Hello,

thanks a lot for a quick response! :slight_smile:

I've tried the approach with synonyms. I already had a dictionary of that
kind with 309.708 german words. The indexing takes some time, but the
searching is really quick.
I've only tried it out with about 30.000 "jobs". Will see how it performs
with 100.000...

Kind regards,
Kostiantyn Kahanskyi

Am Mittwoch, 14. November 2012 17:28:06 UTC+1 schrieb Igor Motov:

The 3) requirement is simple. You can use German Stop filter
http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-tokenfilter.htmlto
filter out stop words and it can be applied during indexing (if you want to
reduce index size) as well as during searching (if you still want to index
"und" but don't want to be able to find it). It's more complicated with 1)
and 2). Typically, stemming is done using snowball analyzer.
But, unfortunately, for German it wouldn't convert Malerin and Maler into
the same term. If you have a limited number of terms that you would like to
translate you can try using Synonym Filterhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html.
You will need to apply it both during indexing as well as during searching.
Here is an example that might help you to get started:
https://gist.github.com/b525c0f0e96139bfdfab

On Wednesday, November 14, 2012 6:55:28 AM UTC-5, Kostiantyn Kahanskyi
wrote:

Hello,

I'm new to ElasticSearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

  1. Searching for "Maler" (german for "male painter") should also find
    "Malerin" (german for "female painter").
  2. Searching for "Malerin" (german for "female painter") should also find
    "Maler" (german for "male painter").
  3. Searching for "und" (german for "and") should find nothing (stopwords
    should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing
index (by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

Hello,

Yes, a plugin would be the best solution. Unfortunately I need a solution
now :frowning:
Would you be so kind and notify me when your plugin is available?

Thanks a lot!

Kind regards
Kostiantyn Kahanskyi

Am Mittwoch, 14. November 2012 17:32:16 UTC+1 schrieb Jörg Prante:

Hi Kostiantyn,

Stemming, at the Lucene analysis stage, can be algorithmic or
dictionary-based. Most Lucene preconfigure stemming, like german stemming,
is algorithmic. The algorithm is based on the report "A Fast and Simple *
Stemming* Algorithm for German Words" by Jörg Caumanns.
*
*
*But, a quick test with curl reveals *

curl 'localhost:9200/_analyze?text=und&analyzer=german&pretty'
{
"tokens" : [ ]
}

So, german stop words work well. No tokens for "und" will get into the
index.

curl 'localhost:9200/_analyze?text=malerin&analyzer=german&pretty'

{
"tokens" : [ {
"token" : "malerin",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
} ]
}

So you are out of luck with the standard Lucene german stemming, because
"malerin" will be "malerin", "maler" will be "mal".

I know it does not help you, but here are my suggestions.

  • build a synonym filter list (kind of a dictionary) for the genus nouns
    you have (easy, but tedious, it depends how many entries you want to manage
    and how often they change)
  • build your own stemmer that can detect the genus of nouns in german
    (hard)
  • or drop stemming, and get a better linguistic method to detect word
    lemmas, e.g. OpenNLP or UIMA, and build Lucene token filters for lemma
    processing (semi-hard)
  • and build ES plugins for advanced lemmatization (easy after having
    Lucene token filters implemented)

If you can wait, I am working on such plugins, because I'm very keen about
German language processing with Elasticsearch, for example with OpenNLP,
UIMA, and the Stanford POS tagger. But, the plugins are not released yet,
I'm stuck with the token filters and the payloads a little bit. I am
curious to find out what Lucene 4 will bring to improve natural language
processing.

Best regards,

Jörg

--

If anyone is interested here is my german
dictionary: https://gist.github.com/4075236

Kind regards
Kostiantyn Kahanskyi

Am Mittwoch, 14. November 2012 12:55:28 UTC+1 schrieb Kostiantyn Kahanskyi:

Hello,

I'm new to ElasticSearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

  1. Searching for "Maler" (german for "male painter") should also find
    "Malerin" (german for "female painter").
  2. Searching for "Malerin" (german for "female painter") should also find
    "Maler" (german for "male painter").
  3. Searching for "und" (german for "and") should find nothing (stopwords
    should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing index
(by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

Git don't display that big files in the overview - just click on "raw" to
view / download the whole file.

Am Mittwoch, 14. November 2012 23:22:44 UTC+1 schrieb Kostiantyn Kahanskyi:

If anyone is interested here is my german dictionary:
https://gist.github.com/4075236

Kind regards
Kostiantyn Kahanskyi

Am Mittwoch, 14. November 2012 12:55:28 UTC+1 schrieb Kostiantyn Kahanskyi:

Hello,

I'm new to ElasticSearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

  1. Searching for "Maler" (german for "male painter") should also find
    "Malerin" (german for "female painter").
  2. Searching for "Malerin" (german for "female painter") should also find
    "Maler" (german for "male painter").
  3. Searching for "und" (german for "and") should find nothing (stopwords
    should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing
index (by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

Yes, of course, I will announce releases here on the mailing list.

Best regards,

Jörg

--

Any news about that? :slight_smile: