U-umlaut search --> indexing user name müller , search fails for müller but success for muller


(nandakumar sedhuraman) #1

indexing user name müller , i am able to search by query muller , but when i query for müller .. it's not returning anything .. could you please help me on this

below is my analyzer configuration.

{
"settings":{
"analysis":{
"analyzer":{
"myOwn_index_analyzer":{
"tokenizer":[
"standard"
],
"filter":[
"standard",
"my_delimiter",
"lowercase",
"icu_folding"
]
},
"myOwn_search_analyzer":{
"tokenizer":[
"standard"
],
"filter":[
"standard",
"my_delimiter",
"lowercase",
"icu_folding"
]
}
},
"filter":{
"my_delimiter":{
"type":"word_delimiter",
"generate_word_parts":true,
"catenate_words":true,
"catenate_numbers":true,
"catenate_all":true,
"split_on_case_change":true,
"preserve_original":true,
"split_on_numerics":true,
"stem_english_possessive":true
}
}
}
}
}


#2

I too have a similar issue. My mapping looks pretty much the same. It looks like the analyzer is replacing the umlaut with a standard ascii character but also not keeping the umlaut. My use case is to allow for searches with or without the special characters. So "ju" and "jü" should both work. I'm trying to use the icu_folder as well - does anyone have any insight or examples of this?

Thanks!


(Jörg Prante) #3

Use german normalizer

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalization-tokenfilter.html


#4

Hi Jorg,

Thanks for the response. I'm currently using the icu_folding filter which seems to take care of the special characters like the german normalizer you mentioned. But it seems like both of these filters "replace" the special characters with an non-extended ascii equivelant. So "ß: get converted to "ss" and "ö" gets converted to "o". That great, but I want to search to succeed for both. So a prefix query of "das" or "daß" should BOTH return the correct document. I was thinking the preserver original set to true would keep both tokens, but it doesn't seem to do so. I'm still pretty new to ES, so forgive me if I'm asking simple questions.

Thanks!


(Jörg Prante) #5

A token filter works best by reducing words to a base form which can be indexed, and apply the token filter also at search time.

Keeping the original token in the token stream can be achieved by the token filter keyword_repeat. It distorts the frequency of words in the index so you must live with it when you wonder about different scoring values. You should add the unique filter to avoid double tokens. Highlighting is supposed not to work any more.

Also, when analyzing german, your method is not complete. Folding is just one part. German umlauts are also valid in expanded form: ä->ae, ö->oe, ü->ue and also ae->ä, oe->ö, and ue->ü. This umlaut conversion has to be performed in a grammar context to avoid errors. The Snowball analyzer is able to do this conversion (see below snowball_german_umlaut)

Also, there is the ICU normalizer. Normalization is an important step before folding if you don't know how the input text is encoded. It converts characters which might be decomposed into a Unicode normalized form. Unicode does not have a distinction between umlaut and diaresis (trema).

With the correct analyzer, you can index

Köln -> koln
Koeln -> koln
Koln -> koln

I have added an unstemmed variant, it omits the german word stemming which Snowball performs.

Here is my solution for german:

{
  "index" : {
    "analysis" : {
      "filter" : {
        "snowball_german_umlaut" : {
          "type" : "snowball",
          "name" : "German2"
        }
      },
      "analyzer" : {
        "stemmed" : {
          "type" : "custom",
          "tokenizer" : "hyphen",
          "filter" : [
            "lowercase",
            "keyword_repeat",
            "icu_normalizer",
            "icu_folding",
            "snowball_german_umlaut",
            "unique"
          ]
        },
        "unstemmed" : {
          "type" : "custom",
          "tokenizer" : "hyphen",
          "filter" : [
            "lowercase",
            "keyword_repeat",
            "icu_normalizer",
            "icu_folding",
            "german_normalize",
            "unique"
          ]
        }
      }
    }
  }
}

The tokenizer hyphen is one of my custom tokenizers which can preserve composite words (Bindestrichwörter) which are important in german language. You can also use default or whitespace tokenizer instead.


#6

Hey, thanks for the info! I think I've got things working well now.

Cheers!

Simi


(system) #7