Prevent standard tokenizer from tokenizing <IDEOGRAPHIC> token per character

yukha-dw · May 5, 2025, 6:55am

Hello, is it possible to use N-Gram Filter on each <IDEOGRAPHIC> phrase just like how it works on <HANGUL>?

For example (min_gram=max_gram=2 & preserve_original):
Input: "我爱青苹果"
Desired Output: "我", "爱", "青苹", "苹果", and "青苹果"

{
          "type": "custom",
          "tokenizer": "ngram_tokenizer_2_2"
}

Output: "我<WHITESPACE>" (undesired), "<WHITESPACE>爱" (undesired), "爱<WHITESPACE>" (undesired),"<WHITESPACE>青" (undesired),"青苹", and "苹果"

If I set ngram as token filter and use standard tokenizer, it will split "青苹果" per character instead:
Setting:

{
          "type": "custom",
          "tokenizer": "standard"
          "filter": ["ngram_filter_2_2"]
}

Output: "我", "爱", "青" (undesired), "苹" (undesired), and "果" (undesired)

If I set ngram as token filter and use whitespace tokenizer, it will return tokens as expected, but it will fail on multilanguage input such as "I love青苹果"
Setting:

{
          "type": "custom",
          "tokenizer": "whitespace"
          "filter": ["ngram_filter_2_2"]
}

Output: "I", "lo", "ov", "ve", "e青" (undesired), "青苹", "苹果", and "love青苹果" (undesired)

Thank you in advance!

yukha-dw · May 8, 2025, 9:56am

up up!

Topic		Replies	Views
edgeNGram filter not keeping the whole words Elasticsearch	2	1359	July 6, 2017
Tokenization help mixed n-grams Elasticsearch	2	361	July 6, 2017
Ngram token filter omits tokens less than min_gram number Elasticsearch	1	316	July 6, 2017
Inverse edge back-Ngram (or making it "fuzzy" at the end of a word)? Elasticsearch	7	1669	July 6, 2017
Token Chars Mapping to Ngram Filter ElasticSearch NEST Elasticsearch	1	1972	July 5, 2017