Why does hyphenation_decompounder require word_list?

Ok, I managed to work around this creating a custom analysis plugin that creates the HyphenationCompoundWordTokenFilter without wordlist.

However, applying the decompunder on index and search time I got unexpected results: at query time, all decompounded subwords seem to be treated as synonyms, so all documents containing just one subword get the same score as documents containing more(?).

I expected the token to be split in mulitple terms which are scored individually so documents containing both of the are ranked higher. This is same expectation as @singer had in #11749, I suppose.

explain results seem to indicate, that any subword is treated as a synonym for the complete compounded word(?):

> "description" : "weight(Synonym(collector.default:scherenbosteler collector.default:scherenbostelerstrasse collector.default:strasse) in 912) [PerFieldSimilarity], result of:",

How could I change this behaviour?

See also this stackoverflow question, if you could provide an explanation for weight(Synonym())

1 Like