Hi everyone,
In several Elasticsearch projects we’ve seen stemming introduce noise early in the analysis pipeline, especially in multilingual setups.
For example:
- “organization” → “organ”
- “news” → “new”
- “united” → “unit”
These kinds of transformations can collapse unrelated terms into the same form, affecting matching quality and leading to less precise results.
In practice, this often results in more complex query logic (ngrams, fuzzy, etc.) or a heavier reliance on semantic search to compensate.
We’ve been exploring an alternative approach based on proper linguistic normalization (lemmatization + decompounding) before indexing, and testing how this impacts both lexical and semantic search performance.
Shared a short write-up with examples here:
https://www.linkedin.com/pulse/how-increase-search-relevance-elasticsearch-better-text-tony-chac%C3%B3n-arkic
Curious how others here are handling this:
-
sticking with stemming / custom analyzers?
-
moving fully to semantic search?
-
or improving normalization upstream?