Why the synonym filter change in 6.0?

Justin_Treher · May 29, 2018, 6:54pm

RE: https://github.com/elastic/elasticsearch/issues/27481

My company has been using Elasticsearch since 2013 to power site search. Our content editors have maintained a synonym configuration for ages via an admin UI that writes the file and ships it to Elasticsearch. I have built a ton of custom analyzers for our search engine that include the synonym tokenfilter.

Today, trying to upgrade to 6.2 from 5.6, we got the dreaded synonym parsing error. And I've gone down the path of trying to clean them up. The issue is that certain synonyms apply in some analyzers and are definitely removed in others. I can't just remove the synonyms, so this means I'd really need to make a synonyms file for each analyzer which seems insane.

In addition, we'll have to validate synonyms by creating a fake index every time our content team updates synonyms to verify that they aren't breaking anything. Are there any workaround here? Ideally, if a synonym doesn't validate because it's been removed, I'd like it just to be ignored and not throw an error. Is this a case for a custom plugin or something that wraps the synonym filter?

Why was the decision made to couple the synonyms to the analyzer settings requiring independent synonyms files per analyzer?

jimczi · May 30, 2018, 9:02am

This change ensures that the filters defined before the synonym filter are applied to the synonym rule. It is mandatory otherwise a synonym with upper case would never match anything if the synonym filter is defined after a lowercase filter. However there are some cases where applying the filters make the synonym invalid. This is the case when a word is removed from the synonym (after a stop filter) or when words are expanded to multiple form by another filter before the synonym filter (a phonetic filter that keeps the original form and adds the phonetic form for instance). The synonym filter cannot match a synonym rule that has a hole or has different form which is we throw an exception when the synonym map is loaded. We don't want to silently ignore a rule that cannot match. Bottom line is that a synonym filter should be applied early in the chain in order to ensure that the filters before him do not change the synonym form to something that cannot be matched.
Do you have an example of an analyzer that fails to load your synonym in 6.x ?

Kalle12345 · May 30, 2018, 10:18am

Hi !

When I use synonym token filter before stopwords token filter it would not match anything and when I put it after stopwords token filter it throw an exception.

jimczi · May 30, 2018, 10:47am

What is the exception when you set a stop filter after the synonym filter ? It should throw an exception if you set if before but not after.

Kalle12345 · May 30, 2018, 11:00am

Hi !

I think you misunderstood. I said I do get exception when I put synonym token filter after stop words token filter.

("swedish_stopwords_tokenfilter", synonym_graph_tokenfilter")

The exception:
Request failed to execute. ServerError: Type: illegal_argument_exception Reason: "failed to build synonyms" CausedBy: "Type: parse_exception Reason: "Invalid synonym rule at line 128" CausedBy: "Type: illegal_argument_exception Reason: "term: svenska för invandrare analyzed to a token (invandrare) with position increment != 1 (got: 2)"""

I think my problem is related to:

Justin_Treher · May 30, 2018, 11:03am

@jimczi I totally get why the synonyms are processed the way they are and I think it's a powerful change. My concern is that it throws when creating an index and there is no way around it. I'd rather it just optionally ignore that error and continue.

Here is an example:

Take the synonym &,and

The ampersand will be eliminated by the standard tokenizer used in custom analyzer A which will throw then cause the index creation to throw. However, that's a synonym that is useful in a different custom analyzer, B, that uses the whitespace tokenizer where the & is preserved.

I just checked and we have over 40 analyzers and 2,300 synonyms. I can't just go in and whack away these synonyms. I can't even tell you how many combinations fail.

As a workaround, do you think it's a matter of just creating a custom synonym plugin that extends the existing synonym tokenfilter and catches the exception and ignores it?

jimczi · May 30, 2018, 2:14pm

Yes you can create a a custom synonym filter and rewrite the SolrSynonymParser. However I get your point of being able to ignore rules that throw an exception. Can you open a feature request in github so that it can be discussed more widely ?

Justin_Treher · May 30, 2018, 3:22pm

@jimczi Here you go https://github.com/elastic/elasticsearch/issues/30968

Thanks for your help.

system · June 27, 2018, 3:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using synonym filter in es 6.0.0-rc1 Elasticsearch	4	1704	November 27, 2017
Working with changing sets of synonyms Elasticsearch	10	893	February 17, 2020
Unable to bypass restriction with synonym token filter Elasticsearch	7	1017	October 3, 2019
Using synonym_graph means non-synonyms are not found Elasticsearch	9	272	March 22, 2023
Adding Synonyms on existing Index Elasticsearch	3	2999	July 6, 2017

Why the synonym filter change in 6.0?

Related topics