Differerences between Portuguese and Brazilian stemmer

rmchugh · October 22, 2018, 9:39am

What are the main differences between the Portuguese and Brazilian language analyzers in 6.x? I've run into some rather strange behaviour when implementing Brazilian and I'm wondering is there any benefit to using the Brazilian analyzer versus the Portuguese one for Brazilian text. The below example shows some weird behaviour when stemming plural forms:

The Portuguese analyzer stems animais to animal so searches for animal will retrieve animais and vice versa. But the Brazilian stemmer stems animais to anim but doesn’t stem animal so these searches won't work. What is the reason for this behaviour?

Portuguese:

curl -s -X GET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer" : "portuguese",
  "text" : "animal animais"
}
' | jq .

{
  "tokens": [
    {
      "token": "animal",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "animal",
      "start_offset": 7,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Brazilian

curl -s -X GET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer" : "brazilian",
  "text" : "animal animais"
}
' | jq .

{
  "tokens": [
    {
      "token": "animal",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "anim",
      "start_offset": 7,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

cbuescher · October 22, 2018, 10:26am

Hi @rmchugh

this is an interesting observation, I would also have expected for the both stemmers to work the same way, but then I don't know enough about the similarity and subtle differences of Portuguese and Brazilian Portuguese.

Elasticsearch uses the Stemmers available in the Lucene project, and the two implementations might differ slightly in either implementation or the resources they use (e.g. dictionaries etc.):

https://lucene.apache.org/core/7_2_1/analyzers-common/org/apache/lucene/analysis/pt/PortugueseStemmer.html

https://lucene.apache.org/core/7_2_1/analyzers-common/org/apache/lucene/analysis/br/BrazilianStemmer.html

By looking at the code I can see that PortugueseStemmer extends RSLPStemmerBase but BrazilianStemmer is just its own class. It looks much more basic to me, but that's just a guess. To get more information about the stemmer implementations and their advantages and shortcomings I'd suggest asking on the Lucene user mailinglist or even try to contact the authors of the individual stemmers. Looking at the git history of both classes there doesn't seem to be much going on in the last few years though...

system · November 19, 2018, 10:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch portuguese stemmer inconsistency Elasticsearch	1	401	March 31, 2021
Differences between light_spanish and spanish stemmers Elasticsearch	2	1004	September 1, 2021
BrazilianStemFilter? Elasticsearch	7	379	July 6, 2017
Analyzer: Problem when generating tokens Elasticsearch	1	384	October 7, 2019
Text not stemmed after inserted in the index with language specific analyzer Elasticsearch	1	191	December 6, 2021

Differerences between Portuguese and Brazilian stemmer

Related Topics