Elastic search smart chinese plugin returns invalid tokens

HARI_RAM · June 5, 2024, 10:46am

ES version: 7.17.15

I was recently checking the smartcn plugin and the tokens that it returned does not look relevant.

I used the config from here - Reimplementing and extending the analyzers | Elasticsearch Plugins and Integrations [8.13] | Elastic
And used the /_analyze endpoint to test the tokens.

When I pass this string, 大卫贝克汉姆今天射入一粒精彩进球。 (Translates to David Beckham scored a wonderful goal today.), it returns 14 tokens out of which 8 seems irrelevant.

Irrelevant tokens and their translations from google:-

大 - Big
卫 - guard
贝 - cowry
克 - gram
汉 - chinese
姆 - Mu
入 - enter
粒 - grain

None of the tokens above relate to my text passed. Am I missing something here or is this how this plugin would behave? This plugin itself doesn't have much configuration available in the official doc.

Is there any way to improve this tokenizer?

PS: I don't know chinese language and I completely rely on google translate to validate the result.

Topic		Replies	Views
Smart Chinese Analyzer returns numbers instead of chinese tokens Elasticsearch	0	534	December 11, 2015
Analyze with smartcn get messy code Elasticsearch	1	448	December 9, 2016
Smart Chinese Analysis returns unicodes instead of chinese tokens Elasticsearch	5	1285	December 15, 2015
Trouble SmartCN analyzer Elasticsearch	4	1064	October 7, 2016
Smartcn analyzer with Chinese punctuation Elasticsearch	0	559	July 6, 2017

Elastic search smart chinese plugin returns invalid tokens

Related topics