My company has been using Elasticsearch since 2013 to power site search. Our content editors have maintained a synonym configuration for ages via an admin UI that writes the file and ships it to Elasticsearch. I have built a ton of custom analyzers for our search engine that include the synonym tokenfilter.
Today, trying to upgrade to 6.2 from 5.6, we got the dreaded synonym parsing error. And I've gone down the path of trying to clean them up. The issue is that certain synonyms apply in some analyzers and are definitely removed in others. I can't just remove the synonyms, so this means I'd really need to make a synonyms file for each analyzer which seems insane.
In addition, we'll have to validate synonyms by creating a fake index every time our content team updates synonyms to verify that they aren't breaking anything. Are there any workaround here? Ideally, if a synonym doesn't validate because it's been removed, I'd like it just to be ignored and not throw an error. Is this a case for a custom plugin or something that wraps the synonym filter?
Why was the decision made to couple the synonyms to the analyzer settings requiring independent synonyms files per analyzer?
This change ensures that the filters defined before the synonym filter are applied to the synonym rule. It is mandatory otherwise a synonym with upper case would never match anything if the synonym filter is defined after a lowercase filter. However there are some cases where applying the filters make the synonym invalid. This is the case when a word is removed from the synonym (after a stop filter) or when words are expanded to multiple form by another filter before the synonym filter (a phonetic filter that keeps the original form and adds the phonetic form for instance). The synonym filter cannot match a synonym rule that has a hole or has different form which is we throw an exception when the synonym map is loaded. We don't want to silently ignore a rule that cannot match. Bottom line is that a synonym filter should be applied early in the chain in order to ensure that the filters before him do not change the synonym form to something that cannot be matched.
Do you have an example of an analyzer that fails to load your synonym in 6.x ?
When I use synonym token filter before stopwords token filter it would not match anything and when I put it after stopwords token filter it throw an exception.
The exception:
Request failed to execute. ServerError: Type: illegal_argument_exception Reason: "failed to build synonyms" CausedBy: "Type: parse_exception Reason: "Invalid synonym rule at line 128" CausedBy: "Type: illegal_argument_exception Reason: "term: svenska för invandrare analyzed to a token (invandrare) with position increment != 1 (got: 2)"""
@jimczi I totally get why the synonyms are processed the way they are and I think it's a powerful change. My concern is that it throws when creating an index and there is no way around it. I'd rather it just optionally ignore that error and continue.
Here is an example:
Take the synonym &,and
The ampersand will be eliminated by the standard tokenizer used in custom analyzer A which will throw then cause the index creation to throw. However, that's a synonym that is useful in a different custom analyzer, B, that uses the whitespace tokenizer where the & is preserved.
I just checked and we have over 40 analyzers and 2,300 synonyms. I can't just go in and whack away these synonyms. I can't even tell you how many combinations fail.
As a workaround, do you think it's a matter of just creating a custom synonym plugin that extends the existing synonym tokenfilter and catches the exception and ignores it?
Yes you can create a a custom synonym filter and rewrite the SolrSynonymParser. However I get your point of being able to ignore rules that throw an exception. Can you open a feature request in github so that it can be discussed more widely ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.