Differerences between Portuguese and Brazilian stemmer


(Ronan Mc Hugh) #1

What are the main differences between the Portuguese and Brazilian language analyzers in 6.x? I've run into some rather strange behaviour when implementing Brazilian and I'm wondering is there any benefit to using the Brazilian analyzer versus the Portuguese one for Brazilian text. The below example shows some weird behaviour when stemming plural forms:

The Portuguese analyzer stems animais to animal so searches for animal will retrieve animais and vice versa. But the Brazilian stemmer stems animais to anim but doesn’t stem animal so these searches won't work. What is the reason for this behaviour?

Portuguese:

curl -s -X GET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer" : "portuguese",
  "text" : "animal animais"
}
' | jq .

{
  "tokens": [
    {
      "token": "animal",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "animal",
      "start_offset": 7,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Brazilian

curl -s -X GET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer" : "brazilian",
  "text" : "animal animais"
}
' | jq .

{
  "tokens": [
    {
      "token": "animal",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "anim",
      "start_offset": 7,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

(Christoph) #2

Hi @rmchugh

this is an interesting observation, I would also have expected for the both stemmers to work the same way, but then I don't know enough about the similarity and subtle differences of Portuguese and Brazilian Portuguese.

Elasticsearch uses the Stemmers available in the Lucene project, and the two implementations might differ slightly in either implementation or the resources they use (e.g. dictionaries etc.):

https://lucene.apache.org/core/7_2_1/analyzers-common/org/apache/lucene/analysis/pt/PortugueseStemmer.html

https://lucene.apache.org/core/7_2_1/analyzers-common/org/apache/lucene/analysis/br/BrazilianStemmer.html

By looking at the code I can see that PortugueseStemmer extends RSLPStemmerBase but BrazilianStemmer is just its own class. It looks much more basic to me, but that's just a guess. To get more information about the stemmer implementations and their advantages and shortcomings I'd suggest asking on the Lucene user mailinglist or even try to contact the authors of the individual stemmers. Looking at the git history of both classes there doesn't seem to be much going on in the last few years though...


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.