What are the main differences between the Portuguese and Brazilian language analyzers in 6.x? I've run into some rather strange behaviour when implementing Brazilian and I'm wondering is there any benefit to using the Brazilian analyzer versus the Portuguese one for Brazilian text. The below example shows some weird behaviour when stemming plural forms:
The Portuguese analyzer stems animais
to animal
so searches for animal
will retrieve animais
and vice versa. But the Brazilian stemmer stems animais
to anim
but doesn’t stem animal
so these searches won't work. What is the reason for this behaviour?
Portuguese:
curl -s -X GET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
"analyzer" : "portuguese",
"text" : "animal animais"
}
' | jq .
{
"tokens": [
{
"token": "animal",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "animal",
"start_offset": 7,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Brazilian
curl -s -X GET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
"analyzer" : "brazilian",
"text" : "animal animais"
}
' | jq .
{
"tokens": [
{
"token": "animal",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "anim",
"start_offset": 7,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}