Serbian analyzer setup

srda1989 · October 23, 2017, 12:31pm

Hi guys, I am trying to implement elasticsearch on my website which has a lot of posts in Serbian language. The main problem occurs when people try to search words with our specific latin letters (šćž ... ). So I figured out how to solve that kind of issue with asciifolding filter (it works amazing). :

But asciifolding filter translates letter "đ" to letter "d" and that doesn't work for me. People here when searching for example "Đoković", they type Djokovic not Dokovic.

To solve this issue I tried to set pattern replace filter and replace words that has letter đ with dj. Bellow is my index analyzer configuration

curl -XPUT 'localhost:9200/my_index' -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "tokenizer" : "standard",
                    "filter" : ["standard", "my_ascii_folding", "lowercase"],
                    "char_filter" : [
                      "small_dj",
                      "big_dj"
                    ]
                }
            },
            "filter" : {
                "my_ascii_folding" : {
                    "type" : "asciifolding",
                    "preserve_original" : true
                }
            },
            "char_filter": {
              "small_dj": {
                "type": "pattern_replace",
                "pattern": "(\\S*)(đ)(\\S*)",
                "replacement": "$0 $1dj$3"
              },
              "big_dj": {
                "type": "pattern_replace",
                "pattern": "(\\S*)(Đ)(\\S*)",
                "replacement": "$0 $1Dj$3"
              }
            }
        }
    }
}';

When analyzing index with string "đoković" I do get all the tokens:

"tokens" : [
    {
      "token" : "dokovic",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "đoković",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "djokovic",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "djoković",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]

This seems OK, so when i try to search i get following results:

Djokovic - FOUND
Đoković - NOT FOUND

Why I can't find when typing Đoković ? Token is here ...

Ivan · October 23, 2017, 8:09pm

What is your mapping and what kind of query? Perhaps something is not being
mapped properly.

Letters such as 'Đ' should be supported by the ASCII folding filter:

github.com

apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java#L396


  output[outputPos++] = 'c';
  break;
case '\u249E': // ⒞  [PARENTHESIZED LATIN SMALL LETTER C]
  output[outputPos++] = '(';
  output[outputPos++] = 'c';
  output[outputPos++] = ')';
  break;
case '\u00D0': // Ð  [LATIN CAPITAL LETTER ETH]
case '\u010E': // Ď  [LATIN CAPITAL LETTER D WITH CARON]
case '\u0110': // Đ  [LATIN CAPITAL LETTER D WITH STROKE]
case '\u0189': // Ɖ  [LATIN CAPITAL LETTER AFRICAN D]
case '\u018A': // Ɗ  [LATIN CAPITAL LETTER D WITH HOOK]
case '\u018B': // Ƌ  [LATIN CAPITAL LETTER D WITH TOPBAR]
case '\u1D05': // ᴅ  [LATIN LETTER SMALL CAPITAL D]
case '\u1D06': // ᴆ  [LATIN LETTER SMALL CAPITAL ETH]
case '\u1E0A': // Ḋ  [LATIN CAPITAL LETTER D WITH DOT ABOVE]
case '\u1E0C': // Ḍ  [LATIN CAPITAL LETTER D WITH DOT BELOW]
case '\u1E0E': // Ḏ  [LATIN CAPITAL LETTER D WITH LINE BELOW]
case '\u1E10': // Ḑ  [LATIN CAPITAL LETTER D WITH CEDILLA]
case '\u1E12': // Ḓ  [LATIN CAPITAL LETTER D WITH CIRCUMFLEX BELOW]
case '\u24B9': // Ⓓ  [CIRCLED LATIN CAPITAL LETTER D]

Živjeli,

Ivan

srda1989 · October 24, 2017, 7:47am

Hi Ivan ( Pozdrav )

Letter Đ is supported in asciifolding filter, but it is translated to letter d, and that is not what I want. I wont it to be translated to letter dj because that's how users will search for, right?

I could use char mapping but I also want to keep original (like asciifolding filter does)

Here is mapping of my_index

{
    "my_index": {
        "mappings": {
            "post": {
                "properties": {
                    "id": {
                        "type": "long"
                    },
                    "title": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    }
                }
            }
        }
    }
}

And here is query (curl)

curl -XGET 'localhost:9200/my_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "query_string": {
            "default_field": "title",
            "query": "djokovic*"
        }
    }
}
'

result:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30138126,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "post",
        "_id" : "1",
        "_score" : 0.30138126,
        "_source" : {
          "id" : 1,
          "title" : "Novak Đoković ponovo prvak mastersa"
        }
      }
    ]
  }
}

but when I search with "đ" (like many users would do)

curl -XGET 'localhost:9200/my_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "query_string": {
            "default_field": "title",
            "query": "đokovic*"
        }
    }
}
'

I get no results

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Ivan · October 24, 2017, 6:44pm

The original JSON was not formatted, so I missed the fact that you are
setting the default analyzer. So it should not be a mapping issue.

Searching for "đokovic" does return a correct result, but you are looking
to apply a wildcard as well "đokovic*". Enable analyze_wildcard on the
query_string query to allow the term to go through the analysis process.

I would also suggest looking into ICU analysis [1] since it goes beyond
basic ASCII folding. ASCII folding is good for removing accents, but not
for Serbian Latin since characters like đ are not simply d with an accent,
but a whole other letter.

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html

Cheers,

Ivan

system · November 21, 2017, 6:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Index analyzer problem with accent! Elasticsearch	1	337	July 6, 2017
Question about asciifolding filter Elasticsearch	3	549	July 6, 2017
Match queries and ASCII folding Elasticsearch	2	393	December 20, 2022
Accent insensitive search with search analyzer Elasticsearch	8	12063	January 30, 2018
Index analyzer settings: is there a way to Elasticsearch	2	335	July 6, 2017

Serbian analyzer setup

Related topics