Smartcn analyzer with Chinese punctuation


I'm wondering if anybody has any insight on how the smartcn analyzer/tokenizer handles (or doesn't handle) Chinese punctuation?


负债表 tokenizes into one single word, but if you had a Chinese comma to the end with 负债表,, the tokenizer now tokenizes this into two separate words: 负债 and . This seems incorrect and it feels like the tokenizer should first filter out any Chinese punctuation marks?

