Better French and German stemming?

bkazez · June 9, 2020, 8:25am

Hello,

I'm using cloud.elastic.co to index metadata about German and French baroque vocal music. I need to treat inflected and uninflected forms as equivalent, so someone can search "schlummer" and find the lovely Bach aria "Schlummert ein."

I expected to have to add some baroque verb forms, but the built-in stemmers are missing even modern forms. What can I do?

Settings:

        "analyzer_full_text_de": {
          "filter": [
            "straighten_apostrophes",
            "lowercase",
            "stop_de",
            "german_normalization",
            "stemmer_de",
            "synonyms_de"
          ],
          "type": "custom",
          "tokenizer": "standard"
        },
        "stemmer_de": {
          "name": "german",
          "type": "stemmer"
        },
        "synonyms_de": {
          "type": "synonym_graph",
          "synonyms": [
            "helfen, hilfen"
          ]
        },
        "analyzer_full_text_fr": {
          "filter": [
            "straighten_apostrophes",
            "elision_fr",
            "lowercase",
            "stop_fr",
            "stemmer_fr",
            "remove_accents"
          ],
          "type": "custom",
          "tokenizer": "standard"
        },
        "stop_fr": {
          "type": "stop",
          "stopwords": "_french_"
        },
        "elision_fr": {
          "type": "elision",
          "articles": [
            "l",
            "m",
            "t",
            "qu",
            "n",
            "s",
            "j",
            "d",
            "c",
            "jusqu",
            "quoiqu",
            "lorsqu",
            "puisqu"
          ],
          "articles_case": "true"
        },
        "stemmer_fr": {
          "name": "french",
          "type": "stemmer"
        },
        "straighten_apostrophes": {
          "pattern": "’",
          "type": "pattern_replace",
          "replacement": "'"
        }

curl -X POST "localhost:9200/.../_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "analyzer_full_text_de",
  "text":     "schlummern schlummert gegrüsst grüssen grussen"
}'

=> schlumm, schlummert, gegrusst, gruss, gruss.
I need schlummern/schlummert => schlumm and gegrüsst => gruss.

curl -X POST "localhost:9200/.../_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "analyzer_full_text_fr",
  "text":     "mal maux"
}'

=> mal, maux.
I need maux => mal.

The other stemmers for these languages didn't work better. What else can I do?

Thanks!
Ben

bkazez · June 18, 2020, 8:33am

To clarify the French issue: the French stemmers "light_french" and "french" both work for many -aux examples like animal=animaux. The bug is that they do not understand mal=maux, which makes me worry that they're missing other common examples.

dadoonet · June 18, 2020, 10:01am

May be you should open an issue in Lucene as the french analyzer is provided by Lucene?

Otherwise, you can look at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-tokenfilter.html? May be it will be better?

bkazez · June 18, 2020, 2:12pm

Thanks for the reply, @dadoonet. I tried the Snowball token filter but got the same results, I think because the French and German stemmers use Snowball behind the scenes. I'll open as issue in Lucene.

system · July 16, 2020, 2:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
German stemmer - Cistem und Caumann stemmer Elasticsearch	12	1937	January 3, 2020
Stemming en français? Discussions en français	3	977	July 16, 2020
Is there any french lemmatizer available for ElasticSearch? Elasticsearch	3	810	May 25, 2017
Problème with french stemmer Elasticsearch	5	561	July 6, 2017
Stop words not used by the analyzer Elasticsearch	5	614	July 6, 2017

Better French and German stemming?

Related topics