dadepo
(Dadepo)
January 4, 2016, 11:53am
1
I have the following analysis set up:
"asciifolding_analyzer": {
"char_filter": ["icu_normalizer"],
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
but when I try it out using the Analyzer API with the following word jọ́jọ́ (notice both the top and bottom accent) i.e.
GET /testindex/_analyze?analyzer=asciifolding_analyzer&text=jọ́jọ́
I get the following results:
{
"tokens": [
{
"token": "jójó",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Notice that the name was normalized to jójó (note the accent on top of the O's). I was expecting it to be normalized to jojo . Why is this the case, and any ideas on how to get the ascii-folding to work 'totally'?
Hmm I was able to recreate your issue. Seems like there is an additional normalisation run needed to get 'jojo' as a result:
localhost:9200/test/_analyze?analyzer=asciifolding_analyzer&text=jọ́jọ́
gives:
{"tokens":[{"token":"jójó","start_offset":0,"end_offset":6,"type":"","position":1}]}
Analysing again with the result from the previous analysis:
localhost:9200/test/_analyze?analyzer=asciifolding_analyzer&text=jójó
gives:
{"tokens":[{"token":"jojo","start_offset":0,"end_offset":4,"type":"","position":1}]}
Not sure why this happens though..
jprante
(Jörg Prante)
January 4, 2016, 1:25pm
3
asciifolding
is not aware of Unicode rules to process characters. You should not mix asciifolding
and ICU.
You should use ICU folding like this
PUT /test
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_icu": {
"filter": [
"lowercase",
"icu_normalizer",
"icu_folding"
],
"tokenizer": "standard"
}
}
}
}
}
}
POST /test/_analyze?analyzer=my_icu
jọ́jọ́
{
"tokens": [
{
"token": "jojo",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
}
]
}
2 Likes
dadepo
(Dadepo)
January 4, 2016, 2:55pm
4
@jprante Yes indeed. Thanks!