U-umlaut search --> indexing user name müller , search fails for müller but success for muller

rmsnandha · September 12, 2016, 4:41pm

indexing user name müller , i am able to search by query muller , but when i query for müller .. it's not returning anything .. could you please help me on this

below is my analyzer configuration.

{
"settings":{
"analysis":{
"analyzer":{
"myOwn_index_analyzer":{
"tokenizer":[
"standard"
],
"filter":[
"standard",
"my_delimiter",
"lowercase",
"icu_folding"
]
},
"myOwn_search_analyzer":{
"tokenizer":[
"standard"
],
"filter":[
"standard",
"my_delimiter",
"lowercase",
"icu_folding"
]
}
},
"filter":{
"my_delimiter":{
"type":"word_delimiter",
"generate_word_parts":true,
"catenate_words":true,
"catenate_numbers":true,
"catenate_all":true,
"split_on_case_change":true,
"preserve_original":true,
"split_on_numerics":true,
"stem_english_possessive":true
}
}
}
}
}

simi · September 15, 2016, 12:05pm

I too have a similar issue. My mapping looks pretty much the same. It looks like the analyzer is replacing the umlaut with a standard ascii character but also not keeping the umlaut. My use case is to allow for searches with or without the special characters. So "ju" and "jü" should both work. I'm trying to use the icu_folder as well - does anyone have any insight or examples of this?

Thanks!

jprante · September 15, 2016, 12:27pm

Use german normalizer

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalization-tokenfilter.html

simi · September 15, 2016, 12:47pm

Hi Jorg,

Thanks for the response. I'm currently using the icu_folding filter which seems to take care of the special characters like the german normalizer you mentioned. But it seems like both of these filters "replace" the special characters with an non-extended ascii equivelant. So "ß: get converted to "ss" and "ö" gets converted to "o". That great, but I want to search to succeed for both. So a prefix query of "das" or "daß" should BOTH return the correct document. I was thinking the preserver original set to true would keep both tokens, but it doesn't seem to do so. I'm still pretty new to ES, so forgive me if I'm asking simple questions.

Thanks!

jprante · September 15, 2016, 1:59pm

A token filter works best by reducing words to a base form which can be indexed, and apply the token filter also at search time.

Keeping the original token in the token stream can be achieved by the token filter keyword_repeat. It distorts the frequency of words in the index so you must live with it when you wonder about different scoring values. You should add the unique filter to avoid double tokens. Highlighting is supposed not to work any more.

Also, when analyzing german, your method is not complete. Folding is just one part. German umlauts are also valid in expanded form: ä->ae, ö->oe, ü->ue and also ae->ä, oe->ö, and ue->ü. This umlaut conversion has to be performed in a grammar context to avoid errors. The Snowball analyzer is able to do this conversion (see below snowball_german_umlaut)

Also, there is the ICU normalizer. Normalization is an important step before folding if you don't know how the input text is encoded. It converts characters which might be decomposed into a Unicode normalized form. Unicode does not have a distinction between umlaut and diaresis (trema).

With the correct analyzer, you can index

Köln -> koln
Koeln -> koln
Koln -> koln

I have added an unstemmed variant, it omits the german word stemming which Snowball performs.

Here is my solution for german:

{
  "index" : {
    "analysis" : {
      "filter" : {
        "snowball_german_umlaut" : {
          "type" : "snowball",
          "name" : "German2"
        }
      },
      "analyzer" : {
        "stemmed" : {
          "type" : "custom",
          "tokenizer" : "hyphen",
          "filter" : [
            "lowercase",
            "keyword_repeat",
            "icu_normalizer",
            "icu_folding",
            "snowball_german_umlaut",
            "unique"
          ]
        },
        "unstemmed" : {
          "type" : "custom",
          "tokenizer" : "hyphen",
          "filter" : [
            "lowercase",
            "keyword_repeat",
            "icu_normalizer",
            "icu_folding",
            "german_normalize",
            "unique"
          ]
        }
      }
    }
  }
}

The tokenizer hyphen is one of my custom tokenizers which can preserve composite words (Bindestrichwörter) which are important in german language. You can also use default or whitespace tokenizer instead.

simi · September 16, 2016, 2:29pm

Hey, thanks for the info! I think I've got things working well now.

Cheers!

Simi

Topic		Replies	Views
Is umlaut expansion such as ü -> [ü, u, ue] possible with built in es tokenizer/filters? Elasticsearch	1	619	March 9, 2019
Analyze German words with umlauts Elasticsearch	3	4210	July 5, 2017
Ways to handle umlauts Elasticsearch	2	4203	July 28, 2017
Documents with german umlauts Elasticsearch	3	2298	August 30, 2017
Match queries and ASCII folding Elasticsearch	2	394	December 20, 2022

U-umlaut search --> indexing user name müller , search fails for müller but success for muller

Related topics