Smartcn analyzer with Chinese punctuation


I'm wondering if anybody has any insight on how the smartcn analyzer/tokenizer handles (or doesn't handle) Chinese punctuation?


负债表 tokenizes into one single word, but if you had a Chinese comma to the end with 负债表,, the tokenizer now tokenizes this into two separate words: 负债 and . This seems incorrect and it feels like the tokenizer should first filter out any Chinese punctuation marks?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.