AutoCompletion for Hindi (Indian Language)

chivas · March 19, 2020, 10:12am

Hello,
I am trying to use the Edge Ngram Tokeniser for auto complete search. Below is my index definition:

PUT hindi_test
{
  "settings": {
    "analysis": {
            "filter": {
        "hindi_stop": {
          "type":       "stop",
          "stopwords":  "_hindi_" 
        },
        "hindi_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["उदाहरण"] 
        },
        "hindi_stemmer": {
          "type":       "stemmer",
          "language":   "hindi"
        }
      },
      "analyzer": {
        "hindi": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": [
            "letter","digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "hindi",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

However, when I analyse using below:

POST hindi_test/_analyze
{
"analyzer": "hindi",
"text": "डीसील्वा"
}

My output is -

{
  "tokens" : [
    {
      "token" : "ड",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "स",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ल",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "व",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    }
  ]
}

What I am assuming is diacritics and conjuncts are being lost during tokenisation. Any pointers if I am missing something in my index definition?
Any help appreciated.
Thanks in advance!

spinscale · March 19, 2020, 2:45pm

Just a wild guess over here due to not having experience with hindi. Try removing parts of your mapping like stemming/stopwords and see if the problem persists.

Also, if you are using Elasticsearch 7.2 and above, you may want to take a look at the search-as-you-type datatype.

--Alex

chivas · March 20, 2020, 9:54am

Thanks @spinscale, the search-as-you-type datatype works fine.
Thanks for your help!

system · April 17, 2020, 9:55am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Edge_ngram results Elasticsearch	4	342	July 6, 2017
Search by digits doesn't work with edge_ngram Elasticsearch	3	1587	July 3, 2018
Issue with Edge NGram Tokenizer in elastic search Elasticsearch	2	649	January 13, 2017
Edge ngram with phrase matching Elasticsearch	7	4842	July 5, 2017
Edge Ngram not working on querying all fields Elasticsearch	1	621	July 4, 2017

AutoCompletion for Hindi (Indian Language)

Related topics