Better French and German stemming?

Hello,

I'm using cloud.elastic.co to index metadata about German and French baroque vocal music. I need to treat inflected and uninflected forms as equivalent, so someone can search "schlummer" and find the lovely Bach aria "Schlummert ein."

I expected to have to add some baroque verb forms, but the built-in stemmers are missing even modern forms. What can I do?

Settings:

        "analyzer_full_text_de": {
          "filter": [
            "straighten_apostrophes",
            "lowercase",
            "stop_de",
            "german_normalization",
            "stemmer_de",
            "synonyms_de"
          ],
          "type": "custom",
          "tokenizer": "standard"
        },
        "stemmer_de": {
          "name": "german",
          "type": "stemmer"
        },
        "synonyms_de": {
          "type": "synonym_graph",
          "synonyms": [
            "helfen, hilfen"
          ]
        },
        "analyzer_full_text_fr": {
          "filter": [
            "straighten_apostrophes",
            "elision_fr",
            "lowercase",
            "stop_fr",
            "stemmer_fr",
            "remove_accents"
          ],
          "type": "custom",
          "tokenizer": "standard"
        },
        "stop_fr": {
          "type": "stop",
          "stopwords": "_french_"
        },
        "elision_fr": {
          "type": "elision",
          "articles": [
            "l",
            "m",
            "t",
            "qu",
            "n",
            "s",
            "j",
            "d",
            "c",
            "jusqu",
            "quoiqu",
            "lorsqu",
            "puisqu"
          ],
          "articles_case": "true"
        },
        "stemmer_fr": {
          "name": "french",
          "type": "stemmer"
        },
        "straighten_apostrophes": {
          "pattern": "’",
          "type": "pattern_replace",
          "replacement": "'"
        }

curl -X POST "localhost:9200/.../_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "analyzer_full_text_de",
  "text":     "schlummern schlummert gegrüsst grüssen grussen"
}'

=> schlumm, schlummert, gegrusst, gruss, gruss.
I need schlummern/schlummert => schlumm and gegrüsst => gruss.

curl -X POST "localhost:9200/.../_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "analyzer_full_text_fr",
  "text":     "mal maux"
}'

=> mal, maux.
I need maux => mal.

The other stemmers for these languages didn't work better. What else can I do?

Thanks!
Ben

To clarify the French issue: the French stemmers "light_french" and "french" both work for many -aux examples like animal=animaux. The bug is that they do not understand mal=maux, which makes me worry that they're missing other common examples.

May be you should open an issue in Lucene as the french analyzer is provided by Lucene?

Otherwise, you can look at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-tokenfilter.html? May be it will be better?

Thanks for the reply, @dadoonet. I tried the Snowball token filter but got the same results, I think because the French and German stemmers use Snowball behind the scenes. I'll open as issue in Lucene.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.