Hello, is it possible to use N-Gram Filter on each <IDEOGRAPHIC> phrase just like how it works on <HANGUL>?
For example (min_gram=max_gram=2 & preserve_original):
Input: "我 爱 青苹果"
Desired Output: "我", "爱", "青苹", "苹果", and "青苹果"
- If I set ngram as tokenizer, it will include whitespace too:
Setting:
{
"type": "custom",
"tokenizer": "ngram_tokenizer_2_2"
}
Output: "我<WHITESPACE>" (undesired), "<WHITESPACE>爱" (undesired), "爱<WHITESPACE>" (undesired),"<WHITESPACE>青" (undesired),"青苹", and "苹果"
- If I set ngram as token filter and use standard tokenizer, it will split "青苹果" per character instead:
Setting:
{
"type": "custom",
"tokenizer": "standard"
"filter": ["ngram_filter_2_2"]
}
Output: "我", "爱", "青" (undesired), "苹" (undesired), and "果" (undesired)
- If I set ngram as token filter and use whitespace tokenizer, it will return tokens as expected, but it will fail on multilanguage input such as "I love青苹果"
Setting:
{
"type": "custom",
"tokenizer": "whitespace"
"filter": ["ngram_filter_2_2"]
}
Output: "I", "lo", "ov", "ve", "e青" (undesired), "青苹", "苹果", and "love青苹果" (undesired)
Thank you in advance!