Synonym analysis before or after Stemmer?

Hi!

Could someone clear up if synonym analysis should happen before or after a stemmer?

Use wordnet style as noted in docs:

{
    "settings": {
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "format" : "wordnet",
                        "synonyms" : [
                            "s(100000001,1,'abstain',v,1,0).",
                            "s(100000001,2,'refrain',v,1,0).",
                            "s(100000001,3,'desist',v,1,0)."
                            ...trimmed...
                        ]
                    }
                }
            }
        }
    }
}

According to the synonym analysis docs here it is more efficient to stem and then do synonym analysis.

This is an important point to consider. What if we want to combine synonyms with stemming, so that jumps, jumped, jump, leaps, leaped, and leap are all indexed as the single term jump? We could place the synonyms filter before the stemmer and list all inflections:

"jumps,jumped,leap,leaps,leaped => jump"
But the more concise way would be to place the synonyms filter after the stemmer, and to list just the root words that would be emitted by the stemmer:

"leap => jump"

This sounds great for something like a stem that is like ran => run or running => run; however if you look at the stemming docs, can see that words can potentially be reduced to stems like jumpi.

For example, jumped and jumps may be reduced to jump, while jumping may be reduced to jumpi.

Does synonym generate synonym tokens for a stemmed word like jumpi? If not, which stemmer stems as much as possible and always results in a real word that synonym filter could find synonyms for? Or should the synonyms list being fed in be stemmed with the same stemmer (getting complicated)? Or contrary to the docs, should you find synonyms first then stem (would this mean there were a bunch of identical tokens or does stemming a list of tokens merge duplicates)?

List of stemmer choices from docs:

english
The porter_stem token filter.

light_english
The kstem token filter.

minimal_english
The EnglishMinimalStemmer in Lucene, which removes plurals

lovins
The Snowball based Lovins stemmer, the first stemmer ever produced.

porter
The Snowball based Porter stemmer

porter2
The Snowball based Porter2 stemmer

possessive_english
The EnglishPossessiveFilter in Lucene which removes 's

Currently using minimal_english. Unclear whether this does something like jumping => jumpi (also unclear how to find out).

Your guidance is appreciated, thanks for the assist, you're the best! :slight_smile:

Synonym filters work against exact token text listed in the synonym file/listing.

Stemmers can stack. So you can apply a porter steamer after a plural or minimal stemmer. So usually I put synonyms after plural/minimal stemming but before porter or other aggressive stemmer. This is easy to avoid having to be exhaustive with every mundane variant, but avoid anticipating the artificial tokens generated by a more aggressive stemmer.

2 Likes

Whoah that's a cool idea Doug! Multiple stemmers at different times, did not think of that. Since they work against exact token text, would think the docs on this should be amended, at least to warn against too aggressively stemming before doing synonyms.

By the way, how is the performance on double stemming?

Looks like from the docs algorithmic stemmers are 400%-500% faster:

Algorithmic stemmers are typically four or five times faster than Hunspell stemmers.

Speed sensitive on this end, curious which of these is algorithmic (what does that even mean isn't everything an algorithm) so I can choose the most agressive one that is also "algorithmic" (faster). They aren't marked. :thinking: Maybe by algorithmic they mean there is no attached dictionary lookup?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.