Hyphenation token filter seems to ignore minimum subword size

Hello,
I was looking to use the hyphenation decompounder filter. However during testing I found out it seems to be ignoring the minimum subword size option.
Here is an example analyze I tried (normally I use a text file with a wordlist)

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "min_subword_size": 3,
      "max_subword_size": 22,
      "type": "hyphenation_decompounder",
      "hyphenation_patterns_path": "analysis/hyph/nl.xml",
      "word_list": [
        "nederland",
        "de",
        "woorden",
        "woord",
        "den"
      ]
    }
  ],
  "text": "nederlandsewoorden"
}

And the result:

{
  "tokens" : [
    {
      "token" : "nederlandsewoorden",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "nederland",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "den",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

I set the minimum subword size to be 3, but it still finds "de". It dissapears when set to 4 but then it won't find "den" anymore either. Is this a bug, or am I not understanding something about this filter.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.