Spanish stemming not working with 4 or less letter words


When trying to do stemming in spanish of 4 or less letter words, the stemming algorithm ignores them completely.

For example, "roja" should return the following token "roj" due to the feminine/masculine form of the word in Spanish as both "rojo" (masculine) and "roja" (feminine) exist. I tested the algortihm used in the demo page of snowballstem and it does process "rojo" as "roj" and other words. But when using the stemmer inside Elasticsearch, 4 letter words are completely ignored.

If I add a "s" in order to make a plural, for example the masculine plural form "rojos", it gets correctly stemmed to "roj".

Tried with both spanish and light_spanish stemmers.

Heres´a capture of some words in singular and plural form. Plural forms get stemmed while singular, 4 letter words don't.

This issue is specially problematic when the same word can be said in its masculine and feminine form. for example: "camisa roja" (the red shirt) and "camisa color rojo" (the red colored shirt). If I want to search "la camisa roja", it should match both forms, as both should have "rojo" and "roja" stemmed to "roj".

This happens specially when an object and its adjectives (or attributes) are saved in an e-commerce website for example. Attributes are always saved in its masculine form:
Object: "Camisa"
Attribute name: "Color"
Attribute value: "Rojo"

But the search form "camisa color rojo" (red colored shirt) is very uncommon. It's much more common to ignore the adjective name (color) and convert the adjective value to its feminine form giving "camisa roja" (red shirt) . So "camisa roja" (red shirt) should also match "camisa color rojo" (red colored shirt).

The issue can also be replicated with other 4 letter words that have masculine and feminine form (but not necessarily mean the same) such as "caso" and "casa", "como" and "coma", etc.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.