ES version: 7.17.15
I was recently checking the smartcn
plugin and the tokens that it returned does not look relevant.
I used the config from here - Reimplementing and extending the analyzers | Elasticsearch Plugins and Integrations [8.13] | Elastic
And used the /_analyze endpoint to test the tokens.
When I pass this string, 大卫贝克汉姆今天射入一粒精彩进球。
(Translates to David Beckham scored a wonderful goal today.
), it returns 14 tokens out of which 8 seems irrelevant.
Irrelevant tokens and their translations from google:-
大 - Big
卫 - guard
贝 - cowry
克 - gram
汉 - chinese
姆 - Mu
入 - enter
粒 - grain
None of the tokens above relate to my text passed. Am I missing something here or is this how this plugin would behave? This plugin itself doesn't have much configuration available in the official doc.
Is there any way to improve this tokenizer?
PS: I don't know chinese language and I completely rely on google translate to validate the result.