Why does hyphenation_decompounder require word_list?

hbruch · January 8, 2018, 4:56pm

HyphenationCompoundWordTokenFilterFactory inherits from AbstractCompoundWordTokenFilterFactory , which performs a mandatory check for a supplied word_list.

As the underlying lucene HyphenationCompoundWordTokenFilter does not require a word_list, is there a specific requirement, why it must be supplied for elasticsearch?

In my use case, I'd like to avoid specifying in advance all possible matching subwords.

Regards,
Holger

hbruch · January 13, 2018, 2:24pm

Ok, I managed to work around this creating a custom analysis plugin that creates the HyphenationCompoundWordTokenFilter without wordlist.

However, applying the decompunder on index and search time I got unexpected results: at query time, all decompounded subwords seem to be treated as synonyms, so all documents containing just one subword get the same score as documents containing more(?).

I expected the token to be split in mulitple terms which are scored individually so documents containing both of the are ranked higher. This is same expectation as @singer had in #11749, I suppose.

explain results seem to indicate, that any subword is treated as a synonym for the complete compounded word(?):

> "description" : "weight(Synonym(collector.default:scherenbosteler collector.default:scherenbostelerstrasse collector.default:strasse) in 912) [PerFieldSimilarity], result of:",

How could I change this behaviour?

See also this stackoverflow question, if you could provide an explanation for weight(Synonym())

hbruch · January 13, 2018, 4:41pm

Seems that reusing the original start/end offset advices the QueryBuild to build a SynonymQuery as it collects all terms with a zero position increment in the uncleared currentQuery.

system · February 10, 2018, 4:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multimatch with CROSS_FIELD query and decompounder Elasticsearch	2	409	March 14, 2022
Hyphenation decompounder - how to use? Elasticsearch	2	1756	July 5, 2017
Configuring a custom plugin Elasticsearch	6	405	July 6, 2017
Hyphenation token filter seems to ignore minimum subword size Elasticsearch	1	254	January 6, 2022
Adding compound word token filter to a template results in “Failed to install template - response code 500 contacting Elasticsearch” Logstash	10	643	August 15, 2019

Why does hyphenation_decompounder require word_list?

Related topics