Working with changing sets of synonyms

Hi, in our search case we have several sets of synonyms that we want to apply in different combinations dynamically (at query time). I read this article that talks about search_analyzer, but I believe for out use case we need it to be even more flexible than that. Can I define a synonym filter and include it in a custom analyzer on the fly, at query time? Or maybe define a bunch of synonym filters at indexing and then refer to them at query time? Is something like that possible?

Thanks

Thanks for your question, I hope you liked the article. Changing the synonym filter content (or defining various ones and refer to them "on the fly") isn't possible at query time. In most use cases that I've seen that need this (e.g. might be in e-commerce where you want to translate the query the user typed into the search box because you can detect upfront what an "iPhone" is and you want to expand it to various other search terms), users do this in some application side query expansion logic.
One plugin that I know about and that tries to adress parts query expansion problem on the users side is Querqy by @renekrie. Maybe you can take a look and describe your usecase a little bit more to see if that helps.

Thanks for your answer cbuescher, I did find the article very helpful in understanding synonyms :+1:

One thing that comes to my mind though, can't you predefine analyzers (rather than filters) that would use different synonym filters, and then refer to them at query time? Is there a downside of having a lot of different analyzers? there will be a lot of duplication but it can be generated at app level. smth like:

'slovenian_company_analyzer': {
                    'tokenizer': 'standard',
                    'filter': [
                        'lowercase',
                        'asciifolding',
                        'slovenian_company_synonyms',
                    ]
                },
                'french_company_analyzer': {
                    'tokenizer': 'standard',
                    'filter': [
                        'lowercase',
                        'asciifolding',
                        'french_company_synonyms',
                    ]
                },
                'english_company_analyzer': {
                    'tokenizer': 'standard',
                    'filter': [
                        'lowercase',
                        'asciifolding',
                        'english_company_synonyms',
                    ]
}

And then at query time

{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "name": {
                            "query": "apple",
                            "boost": 1.5,
                            "analyzer":  "french_company_names":

                        }
                    }
                },
                {
                    "match": {
                        "name": "pear"
                    }
                }
            ]
        }
    }
}

I understand, what you suggest certainly also works if you are willing to pay the cost of loading all the different dictionaries upfront and the number of different analyzers you need is known upfront and not too high. I was under the impression that you might need slightly different synonym sets with every query, which would render the approach a bit unworkable, but for a low number of users (e.g. language dependent) it can make sense and is certainly a good idea to try.

Great, thanks for validating my thinking :slight_smile: I wonder what the costs of loading multiple analyzers are (my example is minimal, in reality it would have to be ~100 if I do go for this approach). I guess the only way to know is to test.

That sounds quite high indeed, although I wouldn't say it doesn't work, depending on dictionary size it might be a bit, since analyzers are also loaded per index & node. Needs thorough testing to be sure. Are these language-dependent variant like your example suggests (so like in a multi-tennant environment)?

Yes, roughly language dependent. They are, in fact, countries, and some countries have multiple languages, so synonym filters do overlap quite a bit (e.g. swiss_company_synonyms might contain german, french and italian synonyms). So far I'm thinking one synonym set per language, one filter+analyzer per country. Each country filter uses one or more synonym sets (synonym set is just a definition we use at the app level, an array of synonym definitions that would later become es filters).

Perhaps a more efficient way is to create multiple indices, one for each country, and use different 'search_analyzers' for each. Then at query time rather than thinking "which synonym set to use" we'll be deciding "which index(es) should we look up". I have no idea what the difference in performance would be though.

One more question on a related note, since I have you here and you wrote that article :slight_smile: (although do let me know if you'd rather it be a separate topic). In that article it is said that synonym filter is now deprecated. Do you mean it's deprecated in Lucene (but not elasticsearch)? How does elasticsearch synonym filter map to lucene synonym filter. If I have multi word synonyms, should I be using elasticsearch synonym_graph filter rather than synonym? Would simply synonym necessarily produce buggy results with multiword filters?

Thanks,
Ivan

I cannot find the exact place where you read that synonym filters are deprecated, can you point me to it? The Lucene SynonymFilter class carries a deprecation warning in its Javadoc and redirects to the new SynonymGraphFilter, but the Elasticsearch filter isn't deprecated so far although we are thinking about doing that as well, no timeline there though.

ES synonym filter maps to Lucenes SynonymFilter and synonym_graph uses Lucenes SynonymGraphFilter under the hood.

Yes, definitely, for the above stated reasons. If you can, you should try to avoid using it at index time though. This blogpost also contains some details around graph token streams.

About deprecated SynonymFilter, I got my articles mixed up, it says so here under "The Solution" paragraph. It seems like it might be talking about Lucene rather than ES.

I think I can mark this as "Solved" now, thanks a lot for your replies, very helpful.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.