ICU Analysis Plugin doesn't normalize some characters like other languages(PHP, Python)

Elasticsearch version: 7.10.1
Installed plugins: [analysis-icu, analysis-kuromoji, analysis-nori]

그래비티 and 그래비티 looks the same. When encoding to URL, 그래비티 is %EA%B7%B8%EB%9E%98%EB%B9%84%ED%8B%B0%20 and 그래비티 is %E1%84%80%E1%85%B3%E1%84%85%E1%85%A2%E1%84%87%E1%85%B5%E1%84%90%E1%85%B5. The difference can be seen by the length of the encoding result.

To solve this issue, I start a test on Python.

PYTHON 3.8.13

import unicodedata

def normalize(word):
    print(len(word), word)
    normalized_word = unicodedata.normalize('NFKC', word)
    print(len(normalized_word), normalized_word)
    
for word in ['그래비티', '그래비티']:
    normalize(word)

The output do solve the problem

8 그래비티
4 그래비티
4 그래비티
4 그래비티

When I test on elasticsearch, using the icu analyzer.
The result don't change.

GET /_analyze

{
    "char_filter": [
        {
            "type": "icu_normalizer",
            "name": "nfkc"
        }
    ],
    "text": "그래비티"
}

The result is the same

그래비티
{
    "tokens": [
        {
            "token": "그래비티",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        }
    ]
}
그래비티
{
    "tokens": [
        {
            "token": "그래비티",
            "start_offset": 0,
            "end_offset": 8,
            "type": "word",
            "position": 0
        }
    ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.