HyphenationCompoundWordTokenFilterFactory inherits from AbstractCompoundWordTokenFilterFactory , which performs a mandatory check for a supplied word_list.
As the underlying lucene HyphenationCompoundWordTokenFilter does not require a word_list, is there a specific requirement, why it must be supplied for elasticsearch?
In my use case, I'd like to avoid specifying in advance all possible matching subwords.
Ok, I managed to work around this creating a custom analysis plugin that creates the HyphenationCompoundWordTokenFilter without wordlist.
However, applying the decompunder on index and search time I got unexpected results: at query time, all decompounded subwords seem to be treated as synonyms, so all documents containing just one subword get the same score as documents containing more(?).
I expected the token to be split in mulitple terms which are scored individually so documents containing both of the are ranked higher. This is same expectation as @singer had in #11749, I suppose.
explain results seem to indicate, that any subword is treated as a synonym for the complete compounded word(?):
> "description" : "weight(Synonym(collector.default:scherenbosteler collector.default:scherenbostelerstrasse collector.default:strasse) in 912) [PerFieldSimilarity], result of:",
How could I change this behaviour?
See also this stackoverflow question, if you could provide an explanation for weight(Synonym())
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.