Differences between light_spanish and spanish stemmers

Daniel_Rojas · August 3, 2021, 8:10pm

We're using the light_spanish stemmer, but we've had some issues with some specific words, for example, if I search "papa", it doesn't show results that contain "papas". So, I tested the spanish stemmer, and it solves those problems, but since we have a large operation, I need to know what are the differences in practice between those two algorithms. Sure, these specific problems are solved with this algorithm, but I don't know if other words will be stemmed incorrectly and cause similar issues in the future.

In any case, if we keep having issues only with some words, we can use this and add custom mappings for those words: Stemmer override token filter | Elasticsearch Guide [7.14] | Elastic

Another question, even the spanish stemmer doesn't seem to work with 3 characters or less, for example, if I search "ajo", it doesn't show results that have the word "ajos", is there a solution for this? Other than adding custom mappings like I said above?

Thanks.

spinscale · August 4, 2021, 9:07am

So, there are indeed two different implementations under the hood. See lucene/SpanishStemmer.java at main · apache/lucene · GitHub and lucene/SpanishLightStemmer.java at main · apache/lucene · GitHub

As you can see in the first lines of code in the latter stemmer, everything with a length of 5 is returned as is and not stemmed at all. That explains your above behaviour.

If you want to test and debug a little how those stemmers behave, you can always go with the Analyze API, see Analyze API | Elasticsearch Guide [7.13] | Elastic

As an example

GET _analyze
{
  "text": [
    "Cómo te llamas?"
  ],
  "tokenizer": "standard",
  "filter": [
    {
      "type": "stemmer",
      "language": "spanish"
    }
  ]
}

GET _analyze
{
  "text": [
    "Cómo te llamas?"
  ],
  "tokenizer": "standard",
  "filter": [
    {
      "type": "stemmer",
      "language": "light_spanish"
    }
  ]
}

system · September 1, 2021, 9:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Differerences between Portuguese and Brazilian stemmer Elasticsearch	2	1316	November 19, 2018
It is possible to use other stemmers for nonenglish languages? Elasticsearch	0	83	April 15, 2024
Elasticsearch portuguese stemmer inconsistency Elasticsearch	1	401	March 31, 2021
Stemmer not working [ES 6.7.1] Elasticsearch	2	487	May 7, 2019
Arabic stemmer and synonymous Elasticsearch	4	2124	December 11, 2017

Differences between light_spanish and spanish stemmers

Related Topics