Hi!
Could someone clear up if synonym analysis should happen before or after a stemmer?
Use wordnet style as noted in docs:
{
"settings": {
"index" : {
"analysis" : {
"filter" : {
"synonym" : {
"type" : "synonym",
"format" : "wordnet",
"synonyms" : [
"s(100000001,1,'abstain',v,1,0).",
"s(100000001,2,'refrain',v,1,0).",
"s(100000001,3,'desist',v,1,0)."
...trimmed...
]
}
}
}
}
}
}
According to the synonym analysis docs here it is more efficient to stem and then do synonym analysis.
This is an important point to consider. What if we want to combine synonyms with stemming, so that jumps, jumped, jump, leaps, leaped, and leap are all indexed as the single term jump? We could place the synonyms filter before the stemmer and list all inflections:
"jumps,jumped,leap,leaps,leaped => jump"
But the more concise way would be to place the synonyms filter after the stemmer, and to list just the root words that would be emitted by the stemmer:
"leap => jump"
This sounds great for something like a stem that is like ran => run
or running => run
; however if you look at the stemming docs, can see that words can potentially be reduced to stems like jumpi
.
For example, jumped and jumps may be reduced to jump, while jumping may be reduced to jumpi.
Does synonym generate synonym tokens for a stemmed word like jumpi? If not, which stemmer stems as much as possible and always results in a real word that synonym filter could find synonyms for? Or should the synonyms list being fed in be stemmed with the same stemmer (getting complicated)? Or contrary to the docs, should you find synonyms first then stem (would this mean there were a bunch of identical tokens or does stemming a list of tokens merge duplicates)?
List of stemmer choices from docs:
english
The porter_stem token filter.light_english
The kstem token filter.minimal_english
The EnglishMinimalStemmer in Lucene, which removes pluralslovins
The Snowball based Lovins stemmer, the first stemmer ever produced.porter
The Snowball based Porter stemmerporter2
The Snowball based Porter2 stemmerpossessive_english
The EnglishPossessiveFilter in Lucene which removes 's
Currently using minimal_english
. Unclear whether this does something like jumping => jumpi
(also unclear how to find out).
Your guidance is appreciated, thanks for the assist, you're the best!