A token filter works best by reducing words to a base form which can be indexed, and apply the token filter also at search time.
Keeping the original token in the token stream can be achieved by the token filter keyword_repeat
. It distorts the frequency of words in the index so you must live with it when you wonder about different scoring values. You should add the unique
filter to avoid double tokens. Highlighting is supposed not to work any more.
Also, when analyzing german, your method is not complete. Folding is just one part. German umlauts are also valid in expanded form: ä->ae, ö->oe, ü->ue and also ae->ä, oe->ö, and ue->ü. This umlaut conversion has to be performed in a grammar context to avoid errors. The Snowball analyzer is able to do this conversion (see below snowball_german_umlaut
)
Also, there is the ICU normalizer. Normalization is an important step before folding if you don't know how the input text is encoded. It converts characters which might be decomposed into a Unicode normalized form. Unicode does not have a distinction between umlaut and diaresis (trema).
With the correct analyzer, you can index
Köln -> koln
Koeln -> koln
Koln -> koln
I have added an unstemmed
variant, it omits the german word stemming which Snowball performs.
Here is my solution for german:
{
"index" : {
"analysis" : {
"filter" : {
"snowball_german_umlaut" : {
"type" : "snowball",
"name" : "German2"
}
},
"analyzer" : {
"stemmed" : {
"type" : "custom",
"tokenizer" : "hyphen",
"filter" : [
"lowercase",
"keyword_repeat",
"icu_normalizer",
"icu_folding",
"snowball_german_umlaut",
"unique"
]
},
"unstemmed" : {
"type" : "custom",
"tokenizer" : "hyphen",
"filter" : [
"lowercase",
"keyword_repeat",
"icu_normalizer",
"icu_folding",
"german_normalize",
"unique"
]
}
}
}
}
}
The tokenizer hyphen
is one of my custom tokenizers which can preserve composite words (Bindestrichwörter) which are important in german language. You can also use default or whitespace tokenizer instead.