Accent with edge ngram token filter

Bob_Guo · February 7, 2023, 7:10am

Hi,

I have a custom analyzer which uses the edge_ngram token filer. Below is the setup:

"analysis": {
        "filter": {
          "my_filter": {
            "type": "edge_ngram",
            "min_gram": "1",
            "max_gram": "10"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "filter": [ "lowercase", "my_filter" ],
            "type": "custom",
            "tokenizer": "standard"
          }
        }
      }

When I use the above analyzer to index some Thai contents, it seems that the edge_ngram filter also takes accent into account when it produces the tokens. For example:

โจ้ นากา would give โ, โจ, โจ้, น, นา, นาก, นากา. So when someone searches either โจ้ (with diacritic) or โจ (without diacritic), that document will be returned, which is ok.

The problem is that when the user searches the term โจ (without diacritic), some documents that contain โจ้ (with diacritic) have a higher rank than those that contain the exact term โจ. I understand why this is happening but not sure how I can solve it. So in this case, I want to give extra scores to those that contain the exact term. Is this possible?

Also, is it possible to disable the accent folding on the edge_ngram filter so that it wouldn't produce the token โจ (without diacritic) in this case?

Any help would be appreciated.

RabBit_BR · February 7, 2023, 11:45am

Hi @Bob_Guo

I think you are working with the Thai tokenizer.
In that case I would try to use the Thai tokenizer to preserve accents and better punctuate the documents.

Bob_Guo · February 8, 2023, 12:15am

Hi @RabBit_BR ,

Thanks for your reply.

I'm actually using the Thai analyzer on a different field. But the Thai analyzer itself isn't good enough especially when the user only searches for a partial keyword, for example, the Thai analyzer would produce the following tokens: โจโฟน, หมด, สดชื่น for the text โจโฟน ได้หมดถ้าสดชื่น. When the user searches for โจ, no results will be returned. So I created this custom analyzer with edge_ngram filter to improve the search results. Below is the mappings:

{
  "properties": {
    "name": {
      "type": "text",
      "analyzer": "thai",
      "fields": {
        "auto": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

I've been trying to find a way to make the edge_ngram filter accent sensitive but no luck. Not sure if this is achievable. I thought this would be a common problem for languages with diacritics but couldn't find anything helpful on the internet.

system · March 8, 2023, 12:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Extending Thai analyzer Elasticsearch	5	1036	July 6, 2017
Extending based on Thai language analyzer Elasticsearch	3	1081	July 6, 2017
Edge_ngram tokenizer and edge_ngram filter don't behave the same? Elasticsearch	1	356	December 30, 2020
How to whitelist terms in a custom analyzer Elasticsearch	4	1106	August 10, 2017
Edge NGram Token Filterを使用した場合のhighlightについて日本語による質問・議論はこちら	5	1732	February 28, 2019

Accent with edge ngram token filter

Related topics