Trouble SmartCN analyzer

Hi all,

curl -XGET 'http://localhost:9200/mm-chs/_analyze?analyzer=smartcn' -d 'curator post'
{"tokens":[{"token":"curat","start_offset":0,"end_offset":7,"type":"word","position":0},{"token":"post","start_offset":8,"end_offset":12,"type":"word","position":1}]}

Anyone can explain why it curat and post? maybe it's end with OR so ... condition

Thanks.

Smartcn Analyzer is composed of HMMChineseTokenizer and PorterStemFilter.
See : https://github.com/apache/lucene-solr/blob/branch_5_5/lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.java#L144

Then, Porter stem token filter stem the word.
If you don't need this feature, you should use smartcn_tokenizer instead.

1 Like

@johtani thanks for your reply

Could you please explain more details about HMMChineseTokenizer
Example: "curator post" should analyze 2 tokens "curator" and "post"

this setting is ok?
var indexSettings = {
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "smartcn",
"tokenizer": "smartcn_tokenizer",
}
}
}
}
};

Thanks

See how to set up custom analyzer : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html

You should replace type value to custom.

{
  "settings": {
    "analysis": {
      "analyzer": {
        "default" : {
          "type": "custom",
          "tokenizer": "smartcn_tokenizer"
        }
      }
    }
  }
}

@johtani thanks a lot, let me try