Ask for suggestion for synonym design


(Xudong You) #1

We are using Elasticsearch building our own site search engine and using Elasticsearch built-in synonym solution to provide better search relevance.
In my option, the key part of the synonym solution is how to build the comprehensive synonym dictionary.
Is there any out-of-the-box synonym dictionary solution we can leverage?
Our solution we considered is to leverage Bing API to generate our own synonym dictionary. But seems heavy and costly.
Any suggestions?


(Shane Connelly) #2

Synonyms have a tendency to be not only language specific but also domain specific and sometimes regional. For example, in the context of elasticsearch, ES is synonymous with "Elasticsearch," while at an atomic research lab, "Einsteinium" is probably a more likely synonym. At a university, it may be synonymous with "Environmental Science" while at a different university it may be synonymous with "Embedded Systems." In other domains it may be "Espanol." ELK tends to be synonymous here with "Elasticsearch, Logstash, Kibana" and "The Elastic Stack" whereas in the general public it may be synonymous with "moose" or "wapiti."

There are a few general purpose synonym lists around the Internet that you can pull in (Elasticsearch supports WordNet and Solr formats), but trying to generalize it too much can be problematic for a number of reasons, including performance and relevancy degradation. In general I've seen much more success with considering your synonyms for your specific business/users. Start by looking for search tokens that produced no results and iterate from there.


(Xudong You) #3

Thanks your advice!
Starting from no-result token sounds good idea. Then the question becomes, how to generate synonyms for those no-result tokens? manually? any automation way?


(Shane Connelly) #4

Honestly, my top advice is try to stick to the simple things like doing this manually when you're just getting started with this type of stuff. That is, manual additions will likely be quick, easy, and you'll know more about your data for the top dozen or so than probably any automated method would be able to tell you. Due to Zipf's law and the principle of least effort, the net value of adding the first few synonyms tends to far outweigh the value of adding, say, the 40th or 50th. Or put another way, you'll get a lot of relative value out of the first dozen or so synonyms you add, which will get you time to think about what the next step in your relevance journey should be (which may or may not be "add more synonyms")

Eventually, you can use some more complicated tricks like combining the significant terms aggregation or more advanced machine learned models into the synonyms, but I wouldn't worry about this right away.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.