Hyphenation decompounder token filter seems to ignore the `only_longest_match` option

Jelena_Malinovic · January 11, 2022, 2:31pm

Hi everyone,

We are trying to improve searches on our website which include compounds words. For that purpose, we are using the token filter hyphenation_decompounder. During testing, we found out that only_longest_match option doesn't work as intended.

Here is the output after running the _analyze API. Of course, we used a much longer list of words, but I have singled out here only words relevant to my example.

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "hyphenation_decompounder",
      "hyphenation_patterns_path" : "german_hyphenation_patterns.xml",
      "only_longest_match": true,
      "max_subword_size": 22,
      "word_list": [
        "kinder",
        "wagen",
        "gen"
      ]
    }
  ],
  "text": "kinderwagen"
}

The result is:

{
  "tokens" : [
    {
      "token" : "kinderwagen",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "kinder",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "wagen",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "gen",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    }
  ]
}

It seems like "gen" is still considered as a relevant token for this case, although a words lists contain "wagen" and we set only_longest_match to true.

Any ideas why this happens? Are we missing something here, in the token filter configuration? Can this be a bug?

Any help is appreciated.

Tomo_M · January 11, 2022, 3:26pm

Hi,

I found a relevant discussion. Unfortunately, it seems a known Lucene behaviour. If so, the name 'only_longest_match' is a bit confusing but consistent with Lucene onlyLongedtMatch flag.

Jelena_Malinovic · January 12, 2022, 5:12pm

Thanks for clarifying this!

system · February 9, 2022, 5:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Compound word token filter: only_longest_match doesn't work as expected in some scenarios Elasticsearch	2	997	September 16, 2019
Hyphenation token filter seems to ignore minimum subword size Elasticsearch	1	254	January 6, 2022
Compound word token filter with german umlaute Elasticsearch	1	691	December 1, 2018
Dictionary decompounder: only longest match doesn't work Elasticsearch	3	261	June 3, 2022
Hyphenation_decompounder tokens do not consider "Operator" in multi-match query Elasticsearch	1	337	February 11, 2021

Hyphenation decompounder token filter seems to ignore the `only_longest_match` option

Related topics