Serbian analyzer setup

Hi guys, I am trying to implement elasticsearch on my website which has a lot of posts in Serbian language. The main problem occurs when people try to search words with our specific latin letters (šćž ... ). So I figured out how to solve that kind of issue with asciifolding filter (it works amazing). :

But asciifolding filter translates letter "đ" to letter "d" and that doesn't work for me. People here when searching for example "Đoković", they type Djokovic not Dokovic.

To solve this issue I tried to set pattern replace filter and replace words that has letter đ with dj. Bellow is my index analyzer configuration

curl -XPUT 'localhost:9200/my_index' -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "tokenizer" : "standard",
                    "filter" : ["standard", "my_ascii_folding", "lowercase"],
                    "char_filter" : [
                      "small_dj",
                      "big_dj"
                    ]
                }
            },
            "filter" : {
                "my_ascii_folding" : {
                    "type" : "asciifolding",
                    "preserve_original" : true
                }
            },
            "char_filter": {
              "small_dj": {
                "type": "pattern_replace",
                "pattern": "(\\S*)(đ)(\\S*)",
                "replacement": "$0 $1dj$3"
              },
              "big_dj": {
                "type": "pattern_replace",
                "pattern": "(\\S*)(Đ)(\\S*)",
                "replacement": "$0 $1Dj$3"
              }
            }
        }
    }
}';

When analyzing index with string "đoković" I do get all the tokens:

"tokens" : [
    {
      "token" : "dokovic",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "đoković",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "djokovic",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "djoković",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]

This seems OK, so when i try to search i get following results:

  1. Djokovic - FOUND
  2. Đoković - NOT FOUND

Why I can't find when typing Đoković ? Token is here ...

What is your mapping and what kind of query? Perhaps something is not being
mapped properly.

Letters such as 'Đ' should be supported by the ASCII folding filter:

Živjeli,

Ivan

Hi Ivan ( Pozdrav :smiley: )

Letter Đ is supported in asciifolding filter, but it is translated to letter d, and that is not what I want. I wont it to be translated to letter dj because that's how users will search for, right?

I could use char mapping but I also want to keep original (like asciifolding filter does)

Here is mapping of my_index

{
    "my_index": {
        "mappings": {
            "post": {
                "properties": {
                    "id": {
                        "type": "long"
                    },
                    "title": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    }
                }
            }
        }
    }
}

And here is query (curl)

curl -XGET 'localhost:9200/my_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "query_string": {
            "default_field": "title",
            "query": "djokovic*"
        }
    }
}
'

result:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30138126,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "post",
        "_id" : "1",
        "_score" : 0.30138126,
        "_source" : {
          "id" : 1,
          "title" : "Novak Đoković ponovo prvak mastersa"
        }
      }
    ]
  }
}

but when I search with "đ" (like many users would do)

curl -XGET 'localhost:9200/my_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "query_string": {
            "default_field": "title",
            "query": "đokovic*"
        }
    }
}
'

I get no results

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

The original JSON was not formatted, so I missed the fact that you are
setting the default analyzer. So it should not be a mapping issue.

Searching for "đokovic" does return a correct result, but you are looking
to apply a wildcard as well "đokovic*". Enable analyze_wildcard on the
query_string query to allow the term to go through the analysis process.

I would also suggest looking into ICU analysis [1] since it goes beyond
basic ASCII folding. ASCII folding is good for removing accents, but not
for Serbian Latin since characters like đ are not simply d with an accent,
but a whole other letter.

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html

Cheers,

Ivan

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.