Hi, recently I noticed an strange behavior by using the portuguese stemmers. First of all, I'm using light_portuguese stemmer in production environment and have some problems stemming the word "comissões" (is equivalent to commissions in english).
Using the _analyze API as below:
GET /_analyze?pretty
{
"text": [
"comissão",
"comissao",
"comissões",
"comissoes"
],
"tokenizer": "whitespace",
"filter": [ { "type": "stemmer", "language": "light_portuguese"} ]
}
I got the following response:
{
"tokens" : [
{
"token" : "comissa",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
},
{
"token" : "comissa",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 101
},
{
"token" : "comissa",
"start_offset" : 18,
"end_offset" : 27,
"type" : "word",
"position" : 202
},
{
"token" : "comisso",
"start_offset" : 28,
"end_offset" : 37,
"type" : "word",
"position" : 303
}
]
}
It was expected to all have same steam because in portuguese "comissões" (commissions) is the plural of "comissão" (commission), but comissoes without tilde (~) ins't returning the same stem as with tilder.
Even using other portuguese stemmers (light_portuguese, minimal_portuguese, portuguese, portuguese_rslp) all token don't get same same stem at all.
Anyone have an idea if there's something to do about that?