Hi,
I'm wondering if anybody has any insight on how the smartcn analyzer/tokenizer handles (or doesn't handle) Chinese punctuation?
Example:
负债表
tokenizes into one single word, but if you had a Chinese comma to the end with 负债表,
, the tokenizer now tokenizes this into two separate words: 负债
and 表
. This seems incorrect and it feels like the tokenizer should first filter out any Chinese punctuation marks?