Compound word token filter with german umlaute

rzlprnft · November 3, 2018, 12:53pm

I have a problem using the Compound word token filter (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-compound-word-tokenfilter.html) when it comes to german umlaute:

Consider this configuration

    "analysis" : {
      "filter" : {
        "german_hyphenation_decompounder" : {
          "only_longest_match" : "true",
          "word_list" : [
            "schwarz",
            "kräuter",
            "tee"
          ],
          "type" : "hyphenation_decompounder",
          "hyphenation_patterns_path" : "/usr/share/elasticsearch/config/hyphenation_patterns.de.xml",
          "min_subword_size" : "3"
        }
      }
    }

I'm using the hyphenation pattern mentioned in the elastic docs (https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download)

The hyphenation works when I analyse schwarztee

[root@acff8d2ab551 elasticsearch]# curl -X GET "localhost:9200/development-products/_analyze?pretty" -H 'Content-Type: application/json' -d'
>  {
>    "tokenizer": "standard",
>    "filter": ["german_hyphenation_decompounder"],
>    "text" : "schwarztee"
>  }
> '
{
  "tokens" : [
    {
      "token" : "schwarztee",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "schwarz",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "tee",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

But it fails when I try to analyse kräutertee (note the umlaut ä)

[root@acff8d2ab551 elasticsearch]# curl -X GET "localhost:9200/development-products/_analyze?pretty" -H 'Content-Type: application/json' -d'
>  {
>    "tokenizer": "standard",
>    "filter": ["german_hyphenation_decompounder"],
>    "text" : "kräutertee"
>  }
> '
{
  "tokens" : [
    {
      "token" : "kräutertee",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

I can confirm that EVERY word where an umlaut is used does not work (for example anhängerkupplung). Maybe the hyphenation patterns can't handle umlaute? But that would be really weird (because it's a specific one for german). I guess my encoding is right, because it returns proper umlaute from the ES config.

Is there anything I can do to get a deeper understanding about the decompound process? I didn't find a way to have a look at just the decompounded words, without the word list match, and looking into the hyphenation patterns XML is rather complicated (I don't get anything what they are doing in there, so it's kind of a black box, but any explanation or resource appreciated).

system · December 1, 2018, 12:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hyphenation decompounder token filter seems to ignore the `only_longest_match` option Elasticsearch	3	616	February 9, 2022
Adding compound word token filter to a template results in “Failed to install template - response code 500 contacting Elasticsearch” Logstash	10	643	August 15, 2019
Compound Words not found but Filter is configured Elasticsearch	5	651	July 5, 2017
Phrase Query breaks with "Compound Word Token Filters" Elasticsearch	6	1090	August 13, 2018
Hyphenation token filter seems to ignore minimum subword size Elasticsearch	1	254	January 6, 2022

Compound word token filter with german umlaute

Related topics