Danish special chars (Æ, Ø, Å) are seen as æ==a/ae, ø == o, å == a

Hi All :slight_smile:

I'm facing a problem with my elastic search for Magento 2 (Wyomind)

In general, it seems the danish special characters Æ, Ø and Å are translated to:
Æ: A /Ae
Ø: O
Å: A

This causes the search not to show the exact products.

E.g.

  1. If i Search for "Nål" It will find "anal" (analytic) because of 'nal' is seen as the same as "nål" becuase of Å == A.
  2. If i search for "åle" it will return "male" (because of ale)

In general - elastic should only find exact matches - and not read the special chars as regular chars.

Hope you guys can help me out.

Here are the setup of my elastic search.

{
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"std": {
"tokenizer": "standard",
"char_filter": "html_strip",
"filter": ["standard", "elision", "asciifolding", "lowercase", "length"]
},
"keyword": {
"tokenizer": "keyword",
"filter": ["asciifolding", "lowercase"]
},
"keyword_prefix": {
"tokenizer": "keyword",
"filter": ["asciifolding", "lowercase", "edge_ngram_front"]
},
"text_prefix": {
"tokenizer": "standard",
"char_filter": "html_strip",
"filter": ["standard", "elision", "asciifolding", "lowercase", "edge_ngram_front"]
},
"text_suffix": {
"tokenizer": "standard",
"char_filter": "html_strip",
"filter": ["standard", "elision", "asciifolding", "lowercase", "edge_ngram_back"]
}
},
"filter": {
"edge_ngram_front": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 10,
"side": "front"
},
"edge_ngram_back": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 10,
"side": "back"
},
"length": {
"type": "length",
"min": 1
}
}
}
}

Hi @smhoeks,

any reason you are not using the built-in danish analyzer? This would probably be the simplest option.

You can use the analyze API to test your analyzer. E.g.:

PUT /sample-index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "analyzer": {
            "std": {
               "tokenizer": "standard",
               "char_filter": "html_strip",
               "filter": [
                  "standard",
                  "elision",
                  "asciifolding",
                  "lowercase",
                  "length"
               ]
            },
            "keyword": {
               "tokenizer": "keyword",
               "filter": [
                  "asciifolding",
                  "lowercase"
               ]
            },
            "keyword_prefix": {
               "tokenizer": "keyword",
               "filter": [
                  "asciifolding",
                  "lowercase",
                  "edge_ngram_front"
               ]
            },
            "text_prefix": {
               "tokenizer": "standard",
               "char_filter": "html_strip",
               "filter": [
                  "standard",
                  "elision",
                  "asciifolding",
                  "lowercase",
                  "edge_ngram_front"
               ]
            },
            "text_suffix": {
               "tokenizer": "standard",
               "char_filter": "html_strip",
               "filter": [
                  "standard",
                  "elision",
                  "asciifolding",
                  "lowercase",
                  "edge_ngram_back"
               ]
            }
         },
         "filter": {
            "edge_ngram_front": {
               "type": "edgeNGram",
               "min_gram": 2,
               "max_gram": 10,
               "side": "front"
            },
            "edge_ngram_back": {
               "type": "edgeNGram",
               "min_gram": 2,
               "max_gram": 10,
               "side": "back"
            },
            "length": {
               "type": "length",
               "min": 1
            }
         }
      }
   }
}
POST /sample-index/_analyze
{
  "analyzer": "std",
  "text":     "Nål"
}

produces:

{
   "tokens": [
      {
         "token": "nal",
         "start_offset": 0,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 0
      }
   ]
}

but

POST /sample-index/_analyze
{
  "analyzer": "danish",
  "text":     "Nål"
}

produces

{
   "tokens": [
      {
         "token": "nål",
         "start_offset": 0,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 0
      }
   ]
}

Daniel

Hi Daniel

Thanks your answer.

I've tried your solution, but maybe im missing something here?
Do i need to change it from "std" to "danish" in? (Im new to elastic)
"analysis":{
"analyzer":{
"std":{

{
"number_of_shards":1,
"number_of_replicas":0,
"analysis":{
"analyzer":{
"std":{
"tokenizer":"standard",
"char_filter":"html_strip",
"filter":[
"standard",
"danish_stemmer",
"elision",
"asciifolding",
"lowercase",
"length"
]
},
"keyword":{
"tokenizer":"keyword",
"filter":[
"asciifolding",
"lowercase"
]
},
"danish":{
"tokenizer":"standard",
"filter":["danish_stemmer"]
},
"keyword_prefix":{
"tokenizer":"keyword",
"filter":[
"asciifolding",
"lowercase",
"edge_ngram_front"
]
},
"text_prefix":{
"tokenizer":"standard",
"char_filter":"html_strip",
"filter":[
"standard",
"elision",
"asciifolding",
"lowercase",
"edge_ngram_front"
]
},
"text_suffix":{
"tokenizer":"standard",
"char_filter":"html_strip",
"filter":[
"standard",
"elision",
"asciifolding",
"lowercase",
"edge_ngram_back"
]
}
},
"filter":{
"edge_ngram_front":{
"type":"edgeNGram",
"min_gram":2,
"max_gram":10,
"side":"front"
},
"edge_ngram_back":{
"type":"edgeNGram",
"min_gram":2,
"max_gram":10,
"side":"back"
},
"length":{
"type":"length",
"min":1
},
"danish_stemmer":{
"type":"stemmer",
"language":"danish"
}
}
}
}

We've added a danish libstemmer to server as well, but maybe this is not needed?

Hi @smhoeks,

You can set the default analyzer for an index in the index settings:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "danish"
        }
      }
    }
  }
}

Note that it has to have the name default to be recognized as the default analyzer for that index. You can set different analyzers also per field (see the docs link above). You had quite some customization in your original analyzer. If you need all that, you probably want to start customizing the danish analyzer.

I don't know what that is but there are not third-party libraries needed. The danish analyzer is already built into Elasticsearch.

Daniel

@danielmitterdorfer

Thanks, i got it to work now.
I changed the analyzer from what it was to danish.

Still faced problems after, but found out that 'asciifolding' was still converting my special chars - after i removed that filter, it all worked.

THanks!

Hi @smhoeks,

glad to hear that all is well now. :slight_smile:

Daniel

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.