Smartcn analyzer with Chinese punctuation

Patrick_Lam · July 6, 2017, 3:08am

Hi,

I'm wondering if anybody has any insight on how the smartcn analyzer/tokenizer handles (or doesn't handle) Chinese punctuation?

Example:

负债表 tokenizes into one single word, but if you had a Chinese comma to the end with 负债表，, the tokenizer now tokenizes this into two separate words: 负债 and 表. This seems incorrect and it feels like the tokenizer should first filter out any Chinese punctuation marks?

system · August 3, 2017, 3:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Smart Chinese Analyzer returns numbers instead of chinese tokens Elasticsearch	1	528	July 5, 2017
Trouble SmartCN analyzer Elasticsearch	5	1042	July 5, 2017
Confused with the Smart Chinese Analysis plugin Elasticsearch	13	4785	July 5, 2017
Analyze with smartcn get messy code Elasticsearch	2	432	January 6, 2017
Asian characters and not words are tokenized - CJK Analysis and Tokenization Problems Elasticsearch	8	730	July 6, 2017

Smartcn analyzer with Chinese punctuation

Related topics