I have a problem using the Compound word token filter (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-compound-word-tokenfilter.html) when it comes to german umlaute:
Consider this configuration
"analysis" : {
"filter" : {
"german_hyphenation_decompounder" : {
"only_longest_match" : "true",
"word_list" : [
"schwarz",
"kräuter",
"tee"
],
"type" : "hyphenation_decompounder",
"hyphenation_patterns_path" : "/usr/share/elasticsearch/config/hyphenation_patterns.de.xml",
"min_subword_size" : "3"
}
}
}
I'm using the hyphenation pattern mentioned in the elastic docs (https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download)
The hyphenation works when I analyse schwarztee
[root@acff8d2ab551 elasticsearch]# curl -X GET "localhost:9200/development-products/_analyze?pretty" -H 'Content-Type: application/json' -d'
> {
> "tokenizer": "standard",
> "filter": ["german_hyphenation_decompounder"],
> "text" : "schwarztee"
> }
> '
{
"tokens" : [
{
"token" : "schwarztee",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "schwarz",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "tee",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
But it fails when I try to analyse kräutertee
(note the umlaut ä
)
[root@acff8d2ab551 elasticsearch]# curl -X GET "localhost:9200/development-products/_analyze?pretty" -H 'Content-Type: application/json' -d'
> {
> "tokenizer": "standard",
> "filter": ["german_hyphenation_decompounder"],
> "text" : "kräutertee"
> }
> '
{
"tokens" : [
{
"token" : "kräutertee",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
I can confirm that EVERY word where an umlaut is used does not work (for example anhängerkupplung
). Maybe the hyphenation patterns can't handle umlaute? But that would be really weird (because it's a specific one for german). I guess my encoding is right, because it returns proper umlaute from the ES config.
Is there anything I can do to get a deeper understanding about the decompound process? I didn't find a way to have a look at just the decompounded words, without the word list match, and looking into the hyphenation patterns XML is rather complicated (I don't get anything what they are doing in there, so it's kind of a black box, but any explanation or resource appreciated).