Hi,
I have a custom analyzer which uses the edge_ngram token filer. Below is the setup:
"analysis": {
"filter": {
"my_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "10"
}
},
"analyzer": {
"my_analyzer": {
"filter": [ "lowercase", "my_filter" ],
"type": "custom",
"tokenizer": "standard"
}
}
}
When I use the above analyzer to index some Thai contents, it seems that the edge_ngram filter also takes accent into account when it produces the tokens. For example:
โจ้ นากา
would give โ
, โจ
, โจ้
, น
, นา
, นาก
, นากา
. So when someone searches either โจ้
(with diacritic) or โจ
(without diacritic), that document will be returned, which is ok.
The problem is that when the user searches the term โจ
(without diacritic), some documents that contain โจ้
(with diacritic) have a higher rank than those that contain the exact term โจ
. I understand why this is happening but not sure how I can solve it. So in this case, I want to give extra scores to those that contain the exact term. Is this possible?
Also, is it possible to disable the accent folding on the edge_ngram filter so that it wouldn't produce the token โจ
(without diacritic) in this case?
Any help would be appreciated.