We are a bit puzzled about some shortcomings when using hunspell for stemming and can't figure out why it fails.
Examples:
gulerod (singular), gulerødder (plural)
mand (singular), mænd (plural)
mønster (singular), mønstre (plural)
Any analyzer/filter/tokenizer combo I try out fails to find a common root (stem) for the first 2, while the last one works.
For the bare-bone example I'm using the latest da_DK
.aff
and .dic
files from LibreOffice.
PUT test/
{
"settings": {
"analysis": {
"analyzer": {
"foo": {
"tokenizer": "icu_tokenizer",
"filter": ["folding", "lowercase", "my_hunspell"]
}
},
"filter": {
"folding": {
"type": "icu_folding",
"unicode_set_filter": "[^æøåÆØÅ]"
},
"my_hunspell": {
"type": "hunspell",
"locale": "da_DK"
}
}
}
}
}
POST test/_analyze
{
"text": ["gulerod", "gulerødder", "mand", "mænd", "mønster", "mønstre"],
"analyzer": "foo"
}
{
"tokens" : [
{
"token" : "gulerod",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "gulerødder",
"start_offset" : 8,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 101
},
{
"token" : "mand",
"start_offset" : 19,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 202
},
{
"token" : "mande",
"start_offset" : 19,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 202
},
{
"token" : "mænd",
"start_offset" : 24,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 303
},
{
"token" : "mønster",
"start_offset" : 29,
"end_offset" : 36,
"type" : "<ALPHANUM>",
"position" : 404
},
{
"token" : "mønstre",
"start_offset" : 37,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 505
},
{
"token" : "mønster",
"start_offset" : 37,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 505
}
]
}
I have tried different tokenizers (standard
and whitespace
) and dropping all filters but my_hunspell
. The outcome is all the same.
I think the issue is with the hunspell/dictionary, because if I try out hunspell in other settings (fx. nodejs and nodehun) I see the same issue.
So my question is: is this a limitation of hunspell or the dictionary? And is there anything I can do?