Problems with Ascii Folding text with Accents

I have the following analysis set up:

  "asciifolding_analyzer": {
    "char_filter": ["icu_normalizer"],
    "tokenizer": "standard",
    "filter":  [ "lowercase", "asciifolding" ]
  }

but when I try it out using the Analyzer API with the following word jọ́jọ́ (notice both the top and bottom accent) i.e.

GET /testindex/_analyze?analyzer=asciifolding_analyzer&text=jọ́jọ́

I get the following results:

{
   "tokens": [
      {
         "token": "jójó",
         "start_offset": 0,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      }
   ]
}

Notice that the name was normalized to jójó (note the accent on top of the O's). I was expecting it to be normalized to jojo. Why is this the case, and any ideas on how to get the ascii-folding to work 'totally'?

Hmm I was able to recreate your issue. Seems like there is an additional normalisation run needed to get 'jojo' as a result:

localhost:9200/test/_analyze?analyzer=asciifolding_analyzer&text=jọ́jọ́

gives:

{"tokens":[{"token":"jójó","start_offset":0,"end_offset":6,"type":"","position":1}]}

Analysing again with the result from the previous analysis:

localhost:9200/test/_analyze?analyzer=asciifolding_analyzer&text=jójó

gives:

{"tokens":[{"token":"jojo","start_offset":0,"end_offset":4,"type":"","position":1}]}

Not sure why this happens though..

asciifolding is not aware of Unicode rules to process characters. You should not mix asciifolding and ICU.

You should use ICU folding like this

PUT /test
{
    "settings": {
         "index": {
            "analysis": {
               "analyzer": {
                  "my_icu": {
                     "filter": [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding"
                     ],
                     "tokenizer": "standard"
                  }
               }
            }
        }
    }
}
POST /test/_analyze?analyzer=my_icu
jọ́jọ́ 
{
  "tokens": [
    {
      "token": "jojo",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}
2 Likes

@jprante Yes indeed. Thanks!