Hunspell fails to stem correctly

We are a bit puzzled about some shortcomings when using hunspell for stemming and can't figure out why it fails.

Examples:

gulerod (singular), gulerødder (plural)
mand (singular), mænd (plural)
mønster (singular), mønstre (plural)

Any analyzer/filter/tokenizer combo I try out fails to find a common root (stem) for the first 2, while the last one works.

For the bare-bone example I'm using the latest da_DK .aff and .dic files from LibreOffice.

PUT test/
{
  "settings": {
    "analysis": {
      "analyzer": {
        "foo": {
          "tokenizer": "icu_tokenizer",
          "filter": ["folding", "lowercase", "my_hunspell"]
        }
      },
      "filter": {
        "folding": {
          "type": "icu_folding",
          "unicode_set_filter": "[^æøåÆØÅ]"
        },
        "my_hunspell": {
          "type": "hunspell",
          "locale": "da_DK"
        }
      }
    }
  }
}


POST test/_analyze
{
  "text": ["gulerod", "gulerødder", "mand", "mænd", "mønster", "mønstre"],
  "analyzer": "foo"
}

{
  "tokens" : [
    {
      "token" : "gulerod",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "gulerødder",
      "start_offset" : 8,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 101
    },
    {
      "token" : "mand",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 202
    },
    {
      "token" : "mande",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 202
    },
    {
      "token" : "mænd",
      "start_offset" : 24,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 303
    },
    {
      "token" : "mønster",
      "start_offset" : 29,
      "end_offset" : 36,
      "type" : "<ALPHANUM>",
      "position" : 404
    },
    {
      "token" : "mønstre",
      "start_offset" : 37,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 505
    },
    {
      "token" : "mønster",
      "start_offset" : 37,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 505
    }
  ]
}

I have tried different tokenizers (standard and whitespace) and dropping all filters but my_hunspell. The outcome is all the same.

I think the issue is with the hunspell/dictionary, because if I try out hunspell in other settings (fx. nodejs and nodehun) I see the same issue.

So my question is: is this a limitation of hunspell or the dictionary? And is there anything I can do?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.