Hello,
I am trying to use the Edge Ngram Tokeniser for auto complete search. Below is my index definition:
PUT hindi_test
{
"settings": {
"analysis": {
"filter": {
"hindi_stop": {
"type": "stop",
"stopwords": "_hindi_"
},
"hindi_keywords": {
"type": "keyword_marker",
"keywords": ["उदाहरण"]
},
"hindi_stemmer": {
"type": "stemmer",
"language": "hindi"
}
},
"analyzer": {
"hindi": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter","digit"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "hindi",
"search_analyzer": "autocomplete_search"
}
}
}
}
However, when I analyse using below:
POST hindi_test/_analyze
{
"analyzer": "hindi",
"text": "डीसील्वा"
}
My output is -
{ "tokens" : [ { "token" : "ड", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "स", "start_offset" : 2, "end_offset" : 3, "type" : "word", "position" : 1 }, { "token" : "ल", "start_offset" : 4, "end_offset" : 5, "type" : "word", "position" : 2 }, { "token" : "व", "start_offset" : 6, "end_offset" : 7, "type" : "word", "position" : 3 } ] }
What I am assuming is diacritics and conjuncts are being lost during tokenisation. Any pointers if I am missing something in my index definition?
Any help appreciated.
Thanks in advance!