I'm wondering if anybody has any insight on how the smartcn analyzer/tokenizer handles (or doesn't handle) Chinese punctuation?
负债表 tokenizes into one single word, but if you had a Chinese comma to the end with
负债表，, the tokenizer now tokenizes this into two separate words:
表. This seems incorrect and it feels like the tokenizer should first filter out any Chinese punctuation marks?