Differences between light_spanish and spanish stemmers

We're using the light_spanish stemmer, but we've had some issues with some specific words, for example, if I search "papa", it doesn't show results that contain "papas". So, I tested the spanish stemmer, and it solves those problems, but since we have a large operation, I need to know what are the differences in practice between those two algorithms. Sure, these specific problems are solved with this algorithm, but I don't know if other words will be stemmed incorrectly and cause similar issues in the future.

In any case, if we keep having issues only with some words, we can use this and add custom mappings for those words: Stemmer override token filter | Elasticsearch Guide [7.14] | Elastic

Another question, even the spanish stemmer doesn't seem to work with 3 characters or less, for example, if I search "ajo", it doesn't show results that have the word "ajos", is there a solution for this? Other than adding custom mappings like I said above?

Thanks.

So, there are indeed two different implementations under the hood. See lucene/SpanishStemmer.java at main · apache/lucene · GitHub and lucene/SpanishLightStemmer.java at main · apache/lucene · GitHub

As you can see in the first lines of code in the latter stemmer, everything with a length of 5 is returned as is and not stemmed at all. That explains your above behaviour.

If you want to test and debug a little how those stemmers behave, you can always go with the Analyze API, see Analyze API | Elasticsearch Guide [7.13] | Elastic

As an example

GET _analyze
{
  "text": [
    "Cómo te llamas?"
  ],
  "tokenizer": "standard",
  "filter": [
    {
      "type": "stemmer",
      "language": "spanish"
    }
  ]
}

GET _analyze
{
  "text": [
    "Cómo te llamas?"
  ],
  "tokenizer": "standard",
  "filter": [
    {
      "type": "stemmer",
      "language": "light_spanish"
    }
  ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.