Hyphenation token filter seems to ignore minimum subword size

mariskaas · December 9, 2021, 7:28am

Hello,
I was looking to use the hyphenation decompounder filter. However during testing I found out it seems to be ignoring the minimum subword size option.
Here is an example analyze I tried (normally I use a text file with a wordlist)

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "min_subword_size": 3,
      "max_subword_size": 22,
      "type": "hyphenation_decompounder",
      "hyphenation_patterns_path": "analysis/hyph/nl.xml",
      "word_list": [
        "nederland",
        "de",
        "woorden",
        "woord",
        "den"
      ]
    }
  ],
  "text": "nederlandsewoorden"
}

And the result:

{
  "tokens" : [
    {
      "token" : "nederlandsewoorden",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "nederland",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "den",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

I set the minimum subword size to be 3, but it still finds "de". It dissapears when set to 4 but then it won't find "den" anymore either. Is this a bug, or am I not understanding something about this filter.

system · January 6, 2022, 7:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hyphenation decompounder token filter seems to ignore the `only_longest_match` option Elasticsearch	3	616	February 9, 2022
Compound word token filter with german umlaute Elasticsearch	1	691	December 1, 2018
Hyphenation decompounder - how to use? Elasticsearch	2	1756	July 5, 2017
Ngram token filter omits tokens less than min_gram number Elasticsearch	1	309	July 6, 2017
Cannot get the Length Token Filter to work in a custom anaylzer Elasticsearch	6	416	January 16, 2020

Hyphenation token filter seems to ignore minimum subword size

Related topics