Accent with edge ngram token filter

Hi,

I have a custom analyzer which uses the edge_ngram token filer. Below is the setup:

"analysis": {
        "filter": {
          "my_filter": {
            "type": "edge_ngram",
            "min_gram": "1",
            "max_gram": "10"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "filter": [ "lowercase", "my_filter" ],
            "type": "custom",
            "tokenizer": "standard"
          }
        }
      }

When I use the above analyzer to index some Thai contents, it seems that the edge_ngram filter also takes accent into account when it produces the tokens. For example:

โจ้ นากา would give , โจ, โจ้, , นา, นาก, นากา. So when someone searches either โจ้ (with diacritic) or โจ (without diacritic), that document will be returned, which is ok.

The problem is that when the user searches the term โจ (without diacritic), some documents that contain โจ้ (with diacritic) have a higher rank than those that contain the exact term โจ. I understand why this is happening but not sure how I can solve it. So in this case, I want to give extra scores to those that contain the exact term. Is this possible?

Also, is it possible to disable the accent folding on the edge_ngram filter so that it wouldn't produce the token โจ (without diacritic) in this case?

Any help would be appreciated.

Hi @Bob_Guo

I think you are working with the Thai tokenizer.
In that case I would try to use the Thai tokenizer to preserve accents and better punctuate the documents.

Hi @RabBit_BR ,

Thanks for your reply.

I'm actually using the Thai analyzer on a different field. But the Thai analyzer itself isn't good enough especially when the user only searches for a partial keyword, for example, the Thai analyzer would produce the following tokens: โจโฟน, หมด, สดชื่น for the text โจโฟน ได้หมดถ้าสดชื่น. When the user searches for โจ, no results will be returned. So I created this custom analyzer with edge_ngram filter to improve the search results. Below is the mappings:

{
  "properties": {
    "name": {
      "type": "text",
      "analyzer": "thai",
      "fields": {
        "auto": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

I've been trying to find a way to make the edge_ngram filter accent sensitive but no luck. Not sure if this is achievable. I thought this would be a common problem for languages with diacritics but couldn't find anything helpful on the internet.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.