Hyphenation decompounder token filter seems to ignore the `only_longest_match` option

Hi everyone,

We are trying to improve searches on our website which include compounds words. For that purpose, we are using the token filter hyphenation_decompounder. During testing, we found out that only_longest_match option doesn't work as intended.

Here is the output after running the _analyze API. Of course, we used a much longer list of words, but I have singled out here only words relevant to my example.

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "hyphenation_decompounder",
      "hyphenation_patterns_path" : "german_hyphenation_patterns.xml",
      "only_longest_match": true,
      "max_subword_size": 22,
      "word_list": [
        "kinder",
        "wagen",
        "gen"
      ]
    }
  ],
  "text": "kinderwagen"
}

The result is:

{
  "tokens" : [
    {
      "token" : "kinderwagen",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "kinder",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "wagen",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "gen",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    }
  ]
}

It seems like "gen" is still considered as a relevant token for this case, although a words lists contain "wagen" and we set only_longest_match to true.

Any ideas why this happens? Are we missing something here, in the token filter configuration? Can this be a bug?

Any help is appreciated.

Hi,

I found a relevant discussion. Unfortunately, it seems a known Lucene behaviour. If so, the name 'only_longest_match' is a bit confusing but consistent with Lucene onlyLongedtMatch flag.

Thanks for clarifying this!

1 Like