Hello,
I was looking to use the hyphenation decompounder filter. However during testing I found out it seems to be ignoring the minimum subword size option.
Here is an example analyze I tried (normally I use a text file with a wordlist)
GET _analyze
{
"tokenizer": "standard",
"filter": [
{
"min_subword_size": 3,
"max_subword_size": 22,
"type": "hyphenation_decompounder",
"hyphenation_patterns_path": "analysis/hyph/nl.xml",
"word_list": [
"nederland",
"de",
"woorden",
"woord",
"den"
]
}
],
"text": "nederlandsewoorden"
}
And the result:
{
"tokens" : [
{
"token" : "nederlandsewoorden",
"start_offset" : 0,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "nederland",
"start_offset" : 0,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "de",
"start_offset" : 0,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "den",
"start_offset" : 0,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
I set the minimum subword size to be 3, but it still finds "de". It dissapears when set to 4 but then it won't find "den" anymore either. Is this a bug, or am I not understanding something about this filter.