Elasticsearch portuguese stemmer inconsistency

Mathulbrich · March 3, 2021, 7:52pm

Hi, recently I noticed an strange behavior by using the portuguese stemmers. First of all, I'm using light_portuguese stemmer in production environment and have some problems stemming the word "comissões" (is equivalent to commissions in english).

Using the _analyze API as below:

GET /_analyze?pretty
{
  "text": [
    "comissão",
    "comissao",
    "comissões",
    "comissoes"
  ],
  "tokenizer": "whitespace",
  "filter": [ { "type": "stemmer", "language": "light_portuguese"} ]
}

I got the following response:

{
  "tokens" : [
    {
      "token" : "comissa",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "comissa",
      "start_offset" : 9,
      "end_offset" : 17,
      "type" : "word",
      "position" : 101
    },
    {
      "token" : "comissa",
      "start_offset" : 18,
      "end_offset" : 27,
      "type" : "word",
      "position" : 202
    },
    {
      "token" : "comisso",
      "start_offset" : 28,
      "end_offset" : 37,
      "type" : "word",
      "position" : 303
    }
  ]
}

It was expected to all have same steam because in portuguese "comissões" (commissions) is the plural of "comissão" (commission), but comissoes without tilde (~) ins't returning the same stem as with tilder.

Even using other portuguese stemmers (light_portuguese, minimal_portuguese, portuguese, portuguese_rslp) all token don't get same same stem at all.

Anyone have an idea if there's something to do about that?

system · March 31, 2021, 7:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Differerences between Portuguese and Brazilian stemmer Elasticsearch	2	1316	November 19, 2018
Differences between light_spanish and spanish stemmers Elasticsearch	2	1004	September 1, 2021
Stemmer not working [ES 6.7.1] Elasticsearch	2	487	May 7, 2019
"apple" and the english analyzer Elasticsearch	4	465	March 25, 2020
Analyzer: Problem when generating tokens Elasticsearch	1	384	October 7, 2019

Elasticsearch portuguese stemmer inconsistency

Related Topics