Elasticsearch version: 7.10.1
Installed plugins: [analysis-icu, analysis-kuromoji, analysis-nori]
그래비티 and 그래비티 looks the same. When encoding to URL, 그래비티 is %EA%B7%B8%EB%9E%98%EB%B9%84%ED%8B%B0%20 and 그래비티 is %E1%84%80%E1%85%B3%E1%84%85%E1%85%A2%E1%84%87%E1%85%B5%E1%84%90%E1%85%B5. The difference can be seen by the length of the encoding result.
To solve this issue, I start a test on Python.
PYTHON 3.8.13
import unicodedata
def normalize(word):
print(len(word), word)
normalized_word = unicodedata.normalize('NFKC', word)
print(len(normalized_word), normalized_word)
for word in ['그래비티', '그래비티']:
normalize(word)
The output do solve the problem
8 그래비티
4 그래비티
4 그래비티
4 그래비티
When I test on elasticsearch, using the icu analyzer.
The result don't change.
GET /_analyze
{
"char_filter": [
{
"type": "icu_normalizer",
"name": "nfkc"
}
],
"text": "그래비티"
}
The result is the same
그래비티
{
"tokens": [
{
"token": "그래비티",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}
그래비티
{
"tokens": [
{
"token": "그래비티",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
}
]
}