Elasticsearch portuguese stemmer inconsistency

Hi, recently I noticed an strange behavior by using the portuguese stemmers. First of all, I'm using light_portuguese stemmer in production environment and have some problems stemming the word "comissões" (is equivalent to commissions in english).

Using the _analyze API as below:

GET /_analyze?pretty
{
  "text": [
    "comissão",
    "comissao",
    "comissões",
    "comissoes"
  ],
  "tokenizer": "whitespace",
  "filter": [ { "type": "stemmer", "language": "light_portuguese"} ]
}

I got the following response:

{
  "tokens" : [
    {
      "token" : "comissa",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "comissa",
      "start_offset" : 9,
      "end_offset" : 17,
      "type" : "word",
      "position" : 101
    },
    {
      "token" : "comissa",
      "start_offset" : 18,
      "end_offset" : 27,
      "type" : "word",
      "position" : 202
    },
    {
      "token" : "comisso",
      "start_offset" : 28,
      "end_offset" : 37,
      "type" : "word",
      "position" : 303
    }
  ]
}

It was expected to all have same steam because in portuguese "comissões" (commissions) is the plural of "comissão" (commission), but comissoes without tilde (~) ins't returning the same stem as with tilder.

Even using other portuguese stemmers (light_portuguese, minimal_portuguese, portuguese, portuguese_rslp) all token don't get same same stem at all.

Anyone have an idea if there's something to do about that?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.