Performing analysis on synonym preprocessing

We have a bunch of SQL tables which defines the business's synonyms. And I'm working on migrating these synonyms to ElastiSearch.

Often these synonyms are not tokenized and expanded forms of the same terms might appear twice. i.e. "apple" is a synonym of "apples", or an example with stopwords, i.e. "Off The Hinge" is a synonym of "OTH".

Now when I try to load these synonyms as is, ElasticSearch has a lot of problems with the entries in the dictionary. In particular, ones that have stopwords, are position increment errors.

My first attempt at solving this was to send the terms to be analyzed by a live node via HTTP. It was way to slow as there were around 20k different terms and it took well over few hours.

My second attempt was to use NTLK, but it was also inconsistent with ElasticSearch's analysis.

I wanted to know if I can replicate this analysis to the synonyms before I'm storing them in the file. I'm happy to hear any suggestion regarding this topic. Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.