Is there any default available synonyms support for a given language that can be turned on? So that if I don't have any specific replacement rules defined it will work just for that language?

Hi,
Is there any default available synonyms support for a given language that can be turned on? So that if I don't have any specific replacement rules defined it will work just for that language?

I've learned how to use synonyms using a custom token filter approach, either by defining the replacement rules embedded or as a separate file within the config directory.

I just wonder is there a sort of default way of say, I just want to turn on synonyms support for English language, so that I don't have to specifically set those rules. As in English, 'great' is indeed a synonym of 'awesome' for example. I can understand if there might be some very domain specific rules for different types application, but should there also be a generic purpose synonyms settings for a given language that can be configured in Elasticsearch to use?

Thanks

This is exactly the problem and the reason why there isn't a "generic" synonym support available as far as I know, neither in Lucene nor in Elasticsearch. A lot of people use something like WordNet or some similar resources available on the web though and it kind of works most of the time. The synonym token filter has Wordnet syntax support, I think you need to download the Prolog version, then dive for the "wn_s.pl" in the tar file and use that. I haven't tried it myself to be honest but looking at the file, the format looks about right. Don't know about the quality though. Let me know if that works, I'd be interested in getting it a shot myself...

1 Like

Just an addition to my last post and for future reference, I got around to play with the "wn_s.pl" file a bit now (downloadable in the prolog files from the Wordnet site. The syntax is a bit complex because the files are meant for import in their own tool chain I suppose, but I found a bit of documentation about the prolog file format, most prominently the s(synset_id,w_num,'word',ss_type,sense_number,tag_count operator that is loaded in our synonym filters.
Most interesting to dig around in the file are probably the synset_id which is what we later use to group similar words and the words itself. I found that there are a lot of very "unusual" synonyms, e.g. "fox" is not only in a synset with "dodger" and "slyboots" but also occasionally regarded as a verb and then expands to things like "trick" or even phrases like "play a joke on". Thats probably too much for a quick-and-dirty general use case.
The good news is that there are some indicators around the frequency of some of these words (I think thats how to interpret the "tag_count", the last number in the entries). Filtering out all entries what have a 0 tag_count reduced the file quite a bit for me but got more reasonable results, at least for the few examples I tried.
In general I think when using Wordnet as a starting point one needs to apply and test some heuristics on top of the file depending on ones own usecase.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.