Hi everyone,
We are trying to improve searches on our website which include compounds words. For that purpose, we are using the token filter hyphenation_decompounder
. During testing, we found out that only_longest_match
option doesn't work as intended.
Here is the output after running the _analyze
API. Of course, we used a much longer list of words, but I have singled out here only words relevant to my example.
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "hyphenation_decompounder",
"hyphenation_patterns_path" : "german_hyphenation_patterns.xml",
"only_longest_match": true,
"max_subword_size": 22,
"word_list": [
"kinder",
"wagen",
"gen"
]
}
],
"text": "kinderwagen"
}
The result is:
{
"tokens" : [
{
"token" : "kinderwagen",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
},
{
"token" : "kinder",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
},
{
"token" : "wagen",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
},
{
"token" : "gen",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
}
]
}
It seems like "gen" is still considered as a relevant token for this case, although a words lists contain "wagen" and we set only_longest_match
to true.
Any ideas why this happens? Are we missing something here, in the token filter configuration? Can this be a bug?
Any help is appreciated.