全角数字を一文字ずつ区切られないようにしたい

Totsuka · July 31, 2024, 8:55am

全角数字を半角数字にし、一文字ずつ区切られないように設定するプラグインをご教示いただけますでしょうか。

□事象

GET kensyo_index/_analyze
{
  "text" : "６０００",
  "analyzer": "my_custom_analyzer"
}

{
  "tokens" : [
    {
      "token" : "６",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "０",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "０",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "０",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    }
  ]
}

□試したこと

kuromojiのCJK width token filter
CJK width token filter | Elasticsearch Guide [7.17] | Elastic
ICU normalization character filterをデフォルト設定で使用
ICU normalization character filter | Elasticsearch Plugins and Integrations [8.14] | Elastic

⇒どちらも全角を半角に変換することはできますが、一文字ずつ区切られる事象は改善されません。

どなたかご回答いただけますと幸いです。
よろしくお願いいたします。

Topic		Replies	Views
Asian characters and not words are tokenized - CJK Analysis and Tokenization Problems Elasticsearch	7	768	March 11, 2011
[analysis] Kuromoji: can't analaze text with Half-width space in user dictionary Elasticsearch	0	286	June 1, 2022
ES Plugin to extend Lucene's Standard Tokenizer Elasticsearch	4	930	September 9, 2014
Ignore gracefully a specific character Elasticsearch	2	654	December 10, 2019
Generate_number_parts not working as expected Elasticsearch	2	786	March 14, 2018

全角数字を一文字ずつ区切られないようにしたい

Related topics