AutoCompletion for Hindi (Indian Language)

Hello,
I am trying to use the Edge Ngram Tokeniser for auto complete search. Below is my index definition:

PUT hindi_test
{
  "settings": {
    "analysis": {
            "filter": {
        "hindi_stop": {
          "type":       "stop",
          "stopwords":  "_hindi_" 
        },
        "hindi_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["उदाहरण"] 
        },
        "hindi_stemmer": {
          "type":       "stemmer",
          "language":   "hindi"
        }
      },
      "analyzer": {
        "hindi": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": [
            "letter","digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "hindi",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

However, when I analyse using below:

POST hindi_test/_analyze
{
"analyzer": "hindi",
"text": "डीसील्वा"
}

My output is -

{
  "tokens" : [
    {
      "token" : "ड",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "स",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ल",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "व",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    }
  ]
}

What I am assuming is diacritics and conjuncts are being lost during tokenisation. Any pointers if I am missing something in my index definition?
Any help appreciated.
Thanks in advance!

Just a wild guess over here due to not having experience with hindi. Try removing parts of your mapping like stemming/stopwords and see if the problem persists.

Also, if you are using Elasticsearch 7.2 and above, you may want to take a look at the search-as-you-type datatype.

--Alex

Thanks @spinscale, the search-as-you-type datatype works fine.
Thanks for your help!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.