Hi guys, I am trying to implement elasticsearch on my website which has a lot of posts in Serbian language. The main problem occurs when people try to search words with our specific latin letters (šćž ... ). So I figured out how to solve that kind of issue with asciifolding filter (it works amazing). :
But asciifolding filter translates letter "đ" to letter "d" and that doesn't work for me. People here when searching for example "Đoković", they type Djokovic not Dokovic.
To solve this issue I tried to set pattern replace filter and replace words that has letter đ with dj. Bellow is my index analyzer configuration
curl -XPUT 'localhost:9200/my_index' -H 'Content-Type: application/json' -d'
{
"settings" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "my_ascii_folding", "lowercase"],
"char_filter" : [
"small_dj",
"big_dj"
]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
},
"char_filter": {
"small_dj": {
"type": "pattern_replace",
"pattern": "(\\S*)(đ)(\\S*)",
"replacement": "$0 $1dj$3"
},
"big_dj": {
"type": "pattern_replace",
"pattern": "(\\S*)(Đ)(\\S*)",
"replacement": "$0 $1Dj$3"
}
}
}
}
}';
When analyzing index with string "đoković" I do get all the tokens:
"tokens" : [
{
"token" : "dokovic",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "đoković",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "djokovic",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "djoković",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
}
]
This seems OK, so when i try to search i get following results:
- Djokovic - FOUND
- Đoković - NOT FOUND
Why I can't find when typing Đoković ? Token is here ...