Problems with Ascii Folding text with Accents


(Dadepo) #1

I have the following analysis set up:

  "asciifolding_analyzer": {
    "char_filter": ["icu_normalizer"],
    "tokenizer": "standard",
    "filter":  [ "lowercase", "asciifolding" ]
  }

but when I try it out using the Analyzer API with the following word jọ́jọ́ (notice both the top and bottom accent) i.e.

GET /testindex/_analyze?analyzer=asciifolding_analyzer&text=jọ́jọ́

I get the following results:

{
   "tokens": [
      {
         "token": "jójó",
         "start_offset": 0,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      }
   ]
}

Notice that the name was normalized to jójó (note the accent on top of the O's). I was expecting it to be normalized to jojo. Why is this the case, and any ideas on how to get the ascii-folding to work 'totally'?


(Byron Voorbach) #2

Hmm I was able to recreate your issue. Seems like there is an additional normalisation run needed to get 'jojo' as a result:

localhost:9200/test/_analyze?analyzer=asciifolding_analyzer&text=jọ́jọ́

gives:

{"tokens":[{"token":"jójó","start_offset":0,"end_offset":6,"type":"","position":1}]}

Analysing again with the result from the previous analysis:

localhost:9200/test/_analyze?analyzer=asciifolding_analyzer&text=jójó

gives:

{"tokens":[{"token":"jojo","start_offset":0,"end_offset":4,"type":"","position":1}]}

Not sure why this happens though..


(Jörg Prante) #3

asciifolding is not aware of Unicode rules to process characters. You should not mix asciifolding and ICU.

You should use ICU folding like this

PUT /test
{
    "settings": {
         "index": {
            "analysis": {
               "analyzer": {
                  "my_icu": {
                     "filter": [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding"
                     ],
                     "tokenizer": "standard"
                  }
               }
            }
        }
    }
}
POST /test/_analyze?analyzer=my_icu
jọ́jọ́ 
{
  "tokens": [
    {
      "token": "jojo",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

(Dadepo) #4

@jprante Yes indeed. Thanks!


(system) #5