Search with stemming and stopwords (german)

Kostiantyn_Kahanskyi · November 14, 2012, 11:55am

Hello,

I'm new to ElasticSearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").
Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").
Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing index
(by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

Igor_Motov · November 14, 2012, 4:28pm

The 3) requirement is simple. You can use German Stop filter
http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-tokenfilter.htmlto
filter out stop words and it can be applied during indexing (if you want to
reduce index size) as well as during searching (if you still want to index
"und" but don't want to be able to find it). It's more complicated with 1)
and 2). Typically, stemming is done using snowball analyzer.
But, unfortunately, for German it wouldn't convert Malerin and Maler into
the same term. If you have a limited number of terms that you would like to
translate you can try using Synonym Filterhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html.
You will need to apply it both during indexing as well as during searching.
Here is an example that might help you to get started:

gist.github.com

https://gist.github.com/imotov/b525c0f0e96139bfdfab

german_search.sh

curl -XDELETE localhost:9200/test 
curl -XPUT localhost:9200/test -d '{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "my_analyzer" : {
                        "type" : "custom",
                        "tokenizer" : "standard",
                        "filter": ["standard", "lowercase", "my_synonym", "my_stop"]

This file has been truncated. show original

On Wednesday, November 14, 2012 6:55:28 AM UTC-5, Kostiantyn Kahanskyi
wrote:

Hello,

I'm new to Elasticsearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").

Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").

Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing index
(by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

jprante · November 14, 2012, 4:32pm

Hi Kostiantyn,

Stemming, at the Lucene analysis stage, can be algorithmic or
dictionary-based. Most Lucene preconfigure stemming, like german stemming,
is algorithmic. The algorithm is based on the report "A Fast and Simple *
Stemming* Algorithm for German Words" by Jörg Caumanns.
*
*
*But, a quick test with curl reveals *

curl 'localhost:9200/_analyze?text=und&analyzer=german&pretty'
{
"tokens" : [ ]
}

So, german stop words work well. No tokens for "und" will get into the
index.

curl 'localhost:9200/_analyze?text=malerin&analyzer=german&pretty'

{
"tokens" : [ {
"token" : "malerin",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
} ]
}

So you are out of luck with the standard Lucene german stemming, because
"malerin" will be "malerin", "maler" will be "mal".

I know it does not help you, but here are my suggestions.

build a synonym filter list (kind of a dictionary) for the genus nouns
you have (easy, but tedious, it depends how many entries you want to manage
and how often they change)
build your own stemmer that can detect the genus of nouns in german (hard)
or drop stemming, and get a better linguistic method to detect word
lemmas, e.g. OpenNLP or UIMA, and build Lucene token filters for lemma
processing (semi-hard)
and build ES plugins for advanced lemmatization (easy after having Lucene
token filters implemented)

If you can wait, I am working on such plugins, because I'm very keen about
German language processing with Elasticsearch, for example with OpenNLP,
UIMA, and the Stanford POS tagger. But, the plugins are not released yet,
I'm stuck with the token filters and the payloads a little bit. I am
curious to find out what Lucene 4 will bring to improve natural language
processing.

Best regards,

Jörg

--

Kostiantyn_Kahanskyi · November 14, 2012, 10:08pm

Hello,

thanks a lot for a quick response!

I've tried the approach with synonyms. I already had a dictionary of that
kind with 309.708 german words. The indexing takes some time, but the
searching is really quick.
I've only tried it out with about 30.000 "jobs". Will see how it performs
with 100.000...

Kind regards,
Kostiantyn Kahanskyi

Am Mittwoch, 14. November 2012 17:28:06 UTC+1 schrieb Igor Motov:

The 3) requirement is simple. You can use German Stop filter
http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-tokenfilter.htmlto
filter out stop words and it can be applied during indexing (if you want to
reduce index size) as well as during searching (if you still want to index
"und" but don't want to be able to find it). It's more complicated with 1)
and 2). Typically, stemming is done using snowball analyzer.
But, unfortunately, for German it wouldn't convert Malerin and Maler into
the same term. If you have a limited number of terms that you would like to
translate you can try using Synonym Filterhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html.
You will need to apply it both during indexing as well as during searching.
Here is an example that might help you to get started:
German stop words · GitHub

On Wednesday, November 14, 2012 6:55:28 AM UTC-5, Kostiantyn Kahanskyi
wrote:

Hello,

I'm new to Elasticsearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").

Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").

Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing
index (by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

Kostiantyn_Kahanskyi · November 14, 2012, 10:11pm

Hello,

Yes, a plugin would be the best solution. Unfortunately I need a solution
now
Would you be so kind and notify me when your plugin is available?

Thanks a lot!

Kind regards
Kostiantyn Kahanskyi

Am Mittwoch, 14. November 2012 17:32:16 UTC+1 schrieb Jörg Prante:

Hi Kostiantyn,

Stemming, at the Lucene analysis stage, can be algorithmic or
dictionary-based. Most Lucene preconfigure stemming, like german stemming,
is algorithmic. The algorithm is based on the report "A Fast and Simple *
Stemming* Algorithm for German Words" by Jörg Caumanns.
*
*
*But, a quick test with curl reveals *

curl 'localhost:9200/_analyze?text=und&analyzer=german&pretty'
{
"tokens" :
}

So, german stop words work well. No tokens for "und" will get into the
index.

curl 'localhost:9200/_analyze?text=malerin&analyzer=german&pretty'

{
"tokens" : [ {
"token" : "malerin",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
} ]
}

So you are out of luck with the standard Lucene german stemming, because
"malerin" will be "malerin", "maler" will be "mal".

I know it does not help you, but here are my suggestions.

build a synonym filter list (kind of a dictionary) for the genus nouns
you have (easy, but tedious, it depends how many entries you want to manage
and how often they change)

build your own stemmer that can detect the genus of nouns in german
(hard)

or drop stemming, and get a better linguistic method to detect word
lemmas, e.g. OpenNLP or UIMA, and build Lucene token filters for lemma
processing (semi-hard)

and build ES plugins for advanced lemmatization (easy after having
Lucene token filters implemented)

If you can wait, I am working on such plugins, because I'm very keen about
German language processing with Elasticsearch, for example with OpenNLP,
UIMA, and the Stanford POS tagger. But, the plugins are not released yet,
I'm stuck with the token filters and the payloads a little bit. I am
curious to find out what Lucene 4 will bring to improve natural language
processing.

Best regards,

Jörg

--

Kostiantyn_Kahanskyi · November 14, 2012, 10:22pm

If anyone is interested here is my german
dictionary: German synonyms for Elasticsearch · GitHub

Kind regards
Kostiantyn Kahanskyi

Am Mittwoch, 14. November 2012 12:55:28 UTC+1 schrieb Kostiantyn Kahanskyi:

Hello,

I'm new to Elasticsearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").

Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").

Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing index
(by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

Kostiantyn_Kahanskyi · November 14, 2012, 10:24pm

Git don't display that big files in the overview - just click on "raw" to
view / download the whole file.

Am Mittwoch, 14. November 2012 23:22:44 UTC+1 schrieb Kostiantyn Kahanskyi:

If anyone is interested here is my german dictionary:
German synonyms for Elasticsearch · GitHub

Kind regards
Kostiantyn Kahanskyi

Am Mittwoch, 14. November 2012 12:55:28 UTC+1 schrieb Kostiantyn Kahanskyi:

Hello,

I'm new to Elasticsearch and would appreciate any kind of help.

I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:

Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").

Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").

Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).

How can I achieve that?
Is it possible to fit the requirements without changing the existing
index (by only manipulating the search parameters) ?

Thanks a lot for any kind of help!

Kind regards
Kostiantyn Kahanskyi

--

jprante · November 14, 2012, 11:00pm

Yes, of course, I will announce releases here on the mailing list.

Best regards,

Jörg

--

Pictor · January 18, 2015, 6:23pm

Any news about that?

Topic		Replies	Views
Stop words not used by the analyzer Elasticsearch	5	613	July 6, 2017
Compound Words not found but Filter is configured Elasticsearch	5	651	July 5, 2017
Operator AND for match queries doesn't work Elasticsearch	9	1306	November 13, 2018
How can I use stopFilter for english and french words in a unique match query? Elasticsearch	6	17	September 8, 2024
Problem with stopword filter, SimpleQueryStringQuery and default operator AND Elasticsearch	1	765	April 2, 2019

Search with stemming and stopwords (german)

Related topics