I'm new to ElasticSearch and would appreciate any kind of help.
I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:
Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").
Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").
Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).
How can I achieve that?
Is it possible to fit the requirements without changing the existing index
(by only manipulating the search parameters) ?
The 3) requirement is simple. You can use German Stop filter http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-tokenfilter.htmlto
filter out stop words and it can be applied during indexing (if you want to
reduce index size) as well as during searching (if you still want to index
"und" but don't want to be able to find it). It's more complicated with 1)
and 2). Typically, stemming is done using snowball analyzer.
But, unfortunately, for German it wouldn't convert Malerin and Maler into
the same term. If you have a limited number of terms that you would like to
translate you can try using Synonym Filterhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html.
You will need to apply it both during indexing as well as during searching.
Here is an example that might help you to get started:
On Wednesday, November 14, 2012 6:55:28 AM UTC-5, Kostiantyn Kahanskyi
wrote:
Hello,
I'm new to Elasticsearch and would appreciate any kind of help.
I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:
Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").
Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").
Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).
How can I achieve that?
Is it possible to fit the requirements without changing the existing index
(by only manipulating the search parameters) ?
Stemming, at the Lucene analysis stage, can be algorithmic or
dictionary-based. Most Lucene preconfigure stemming, like german stemming,
is algorithmic. The algorithm is based on the report "A Fast and Simple *
Stemming* Algorithm for German Words" by Jörg Caumanns.
*
*
*But, a quick test with curl reveals *
So you are out of luck with the standard Lucene german stemming, because
"malerin" will be "malerin", "maler" will be "mal".
I know it does not help you, but here are my suggestions.
build a synonym filter list (kind of a dictionary) for the genus nouns
you have (easy, but tedious, it depends how many entries you want to manage
and how often they change)
build your own stemmer that can detect the genus of nouns in german (hard)
or drop stemming, and get a better linguistic method to detect word
lemmas, e.g. OpenNLP or UIMA, and build Lucene token filters for lemma
processing (semi-hard)
and build ES plugins for advanced lemmatization (easy after having Lucene
token filters implemented)
If you can wait, I am working on such plugins, because I'm very keen about
German language processing with Elasticsearch, for example with OpenNLP,
UIMA, and the Stanford POS tagger. But, the plugins are not released yet,
I'm stuck with the token filters and the payloads a little bit. I am
curious to find out what Lucene 4 will bring to improve natural language
processing.
I've tried the approach with synonyms. I already had a dictionary of that
kind with 309.708 german words. The indexing takes some time, but the
searching is really quick.
I've only tried it out with about 30.000 "jobs". Will see how it performs
with 100.000...
Kind regards,
Kostiantyn Kahanskyi
Am Mittwoch, 14. November 2012 17:28:06 UTC+1 schrieb Igor Motov:
The 3) requirement is simple. You can use German Stop filter http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-tokenfilter.htmlto
filter out stop words and it can be applied during indexing (if you want to
reduce index size) as well as during searching (if you still want to index
"und" but don't want to be able to find it). It's more complicated with 1)
and 2). Typically, stemming is done using snowball analyzer.
But, unfortunately, for German it wouldn't convert Malerin and Maler into
the same term. If you have a limited number of terms that you would like to
translate you can try using Synonym Filterhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html.
You will need to apply it both during indexing as well as during searching.
Here is an example that might help you to get started: German stop words · GitHub
On Wednesday, November 14, 2012 6:55:28 AM UTC-5, Kostiantyn Kahanskyi
wrote:
Hello,
I'm new to Elasticsearch and would appreciate any kind of help.
I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:
Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").
Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").
Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).
How can I achieve that?
Is it possible to fit the requirements without changing the existing
index (by only manipulating the search parameters) ?
Yes, a plugin would be the best solution. Unfortunately I need a solution
now
Would you be so kind and notify me when your plugin is available?
Thanks a lot!
Kind regards
Kostiantyn Kahanskyi
Am Mittwoch, 14. November 2012 17:32:16 UTC+1 schrieb Jörg Prante:
Hi Kostiantyn,
Stemming, at the Lucene analysis stage, can be algorithmic or
dictionary-based. Most Lucene preconfigure stemming, like german stemming,
is algorithmic. The algorithm is based on the report "A Fast and Simple *
Stemming* Algorithm for German Words" by Jörg Caumanns.
*
*
*But, a quick test with curl reveals *
So you are out of luck with the standard Lucene german stemming, because
"malerin" will be "malerin", "maler" will be "mal".
I know it does not help you, but here are my suggestions.
build a synonym filter list (kind of a dictionary) for the genus nouns
you have (easy, but tedious, it depends how many entries you want to manage
and how often they change)
build your own stemmer that can detect the genus of nouns in german
(hard)
or drop stemming, and get a better linguistic method to detect word
lemmas, e.g. OpenNLP or UIMA, and build Lucene token filters for lemma
processing (semi-hard)
and build ES plugins for advanced lemmatization (easy after having
Lucene token filters implemented)
If you can wait, I am working on such plugins, because I'm very keen about
German language processing with Elasticsearch, for example with OpenNLP,
UIMA, and the Stanford POS tagger. But, the plugins are not released yet,
I'm stuck with the token filters and the payloads a little bit. I am
curious to find out what Lucene 4 will bring to improve natural language
processing.
Am Mittwoch, 14. November 2012 12:55:28 UTC+1 schrieb Kostiantyn Kahanskyi:
Hello,
I'm new to Elasticsearch and would appreciate any kind of help.
I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:
Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").
Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").
Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).
How can I achieve that?
Is it possible to fit the requirements without changing the existing index
(by only manipulating the search parameters) ?
Am Mittwoch, 14. November 2012 12:55:28 UTC+1 schrieb Kostiantyn Kahanskyi:
Hello,
I'm new to Elasticsearch and would appreciate any kind of help.
I have a type "job" with some text fields (all analyzed) and I would like
to search the "jobs" using stopwords and stemming.
In particular I have 3 requirements:
Searching for "Maler" (german for "male painter") should also find
"Malerin" (german for "female painter").
Searching for "Malerin" (german for "female painter") should also find
"Maler" (german for "male painter").
Searching for "und" (german for "and") should find nothing (stopwords
should be ignored).
How can I achieve that?
Is it possible to fit the requirements without changing the existing
index (by only manipulating the search parameters) ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.