Confused with the Smart Chinese Analysis plugin

I set smartcn as default analyzer in config file

index.analysis.analyzer.default.type : "smartcn"

And I create a type item which had a string field name, like this:

"name": "开关/插座 -代金券-西门子"

You can see it as A/B - C-D if you are confused with Chinese.
I tried the analyze api

GET /newmall/_analyze
{
  "text":"开关/插座 -代金券-西门子"
}

The text successfully cut into “A”, "B", "C", "D". All the stopwords and punctuation were deprecated.
When I searched “B", "C", "D" separately, it all returned the right doc.
However, when I searched “A”, which is "开关", it hit zero.

GET /newmall/item/_search
{
  "query": {
    "match": {
      "name": "开关"
    }
  }
}

And the funny thing is that when I tried with the term “A/”, which is "开关/". It returned the right doc.

GET /newmall/item/_search
{
  "query": {
    "match": {
      "name": "开关/"
    }
  }
}

Hi,

I have to admit that I have not the slightest clue about Chinese so bear with me. :slight_smile:

Based on what I see, the analysis process does not seem to emit the token 开关 correctly. Could you please post the response of:

GET /newmall/_analyze
{
  "text":"开关/插座 -代金券-西门子"
}

Daniel

I've just tried this on Elasticsearch 2.3.0 and get the same issue. Here are the steps to reproduce

1.install smart chinese analysis plugin

sudo bin/plugin install analysis-smartcn

2.create an index

curl -XPUT "http://localhost:9200/newmall" -d'
{
  "mappings": {
    "mall":{
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "smartcn"
        }
      }
    }
  }
}'

3.Test the analysis of "开关/插座 -代金券-西门子"

curl -XGET "http://localhost:9200/newmall/_analyze?analyzer=smartcn" -d'
{
  "text":"开关/插座 -代金券-西门子"
}'

yields the correct tokens

{
  "tokens": [
    {
      "token": "开关",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "插座",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "代金",
      "start_offset": 7,
      "end_offset": 9,
      "type": "word",
      "position": 4
    },
    {
      "token": "券",
      "start_offset": 9,
      "end_offset": 10,
      "type": "word",
      "position": 5
    },
    {
      "token": "西门子",
      "start_offset": 11,
      "end_offset": 14,
      "type": "word",
      "position": 7
    }
  ]
}

4.Index a document with text we just analyzed

curl -XPOST "http://localhost:9200/newmall/mall/1" -d'
{
  "text": "开关/插座 -代金券-西门子"
}'

5.Perform match query on "开关"

curl -XGET "http://localhost:9200/newmall/mall/_search?explain" -d'
{
  "query": {
    "match": {
      "text": "开关"
    }
  }
}'

yields no results

6.But performing match query on "开关/"

curl -XGET "http://localhost:9200/newmall/mall/_search?explain" -d'
{
  "query": {
    "match": {
      "text": "开关/"
    }
  }
}'

yields results (with explanation)

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.13424811,
    "hits": [
      {
        "_shard": 3,
        "_node": "F4_VMt-7Qpi11ACD3y4QYw",
        "_index": "newmall",
        "_type": "mall",
        "_id": "1",
        "_score": 0.13424811,
        "_source": {
          "text": "开关/插座 -代金券-西门子"
        },
        "_explanation": {
          "value": 0.13424811,
          "description": "sum of:",
          "details": [
            {
              "value": 0.13424811,
              "description": "weight(text:开关 in 0) [PerFieldSimilarity], result of:",
              "details": [
                {
                  "value": 0.13424811,
                  "description": "fieldWeight in 0, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "tf(freq=1.0), with freq of:",
                      "details": [
                        {
                          "value": 1,
                          "description": "termFreq=1.0",
                          "details": []
                        }
                      ]
                    },
                    {
                      "value": 0.30685282,
                      "description": "idf(docFreq=1, maxDocs=1)",
                      "details": []
                    },
                    {
                      "value": 0.4375,
                      "description": "fieldNorm(doc=0)",
                      "details": []
                    }
                  ]
                }
              ]
            },
            {
              "value": 0,
              "description": "match on required clause, product of:",
              "details": [
                {
                  "value": 0,
                  "description": "# clause",
                  "details": []
                },
                {
                  "value": 3.2588913,
                  "description": "_type:mall, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": 3.2588913,
                      "description": "queryNorm",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

@forloop had reproduced the issue below

That's exactly the issue what I had met.
Thank you for you reproducing. My poor English may not explain the issue clearly.

I tried two more things:

GET /newmall/_analyze?analyzer=smartcn
{
    "text": "开关"
}

This returns:

{
   "tokens": [
      {
         "token": "开",
         "start_offset": 0,
         "end_offset": 1,
         "type": "word",
         "position": 0
      },
      {
         "token": "关",
         "start_offset": 1,
         "end_offset": 2,
         "type": "word",
         "position": 1
      }
   ]
}

Whereas:

GET /newmall/_analyze?analyzer=smartcn
{
    "text": "开关/"
}

returns

{
   "tokens": [
      {
         "token": "开关",
         "start_offset": 0,
         "end_offset": 2,
         "type": "word",
         "position": 0
      }
   ]
}

So it appears to me that the problem is in the search phase. For a Western language I'd say this is weird but I don't know whether it makes sense to tokenize "开关" as two separate tokens or they always belong together.

As an alternative you could try if you get better results with the icu-analysis plugin.

Daniel

Hi, Daniel
开 means switch on, while 关 means switch off. Both are verbs.
If combined together, 开关 means noun switch.

So it's the problem of the words segmentation algorithm?
When indexing, the smartcn analyzer cut the text into ["开关", "插座", "代金券", "西门子"].
When searching, the smartcn analyzer cut the query "开关" into ["开", "关"].

But why cannot the term (or called token?) "开" and “关” match "开关"?
As contrast, could "turn" or "over" match "turnover"?

Can you try "开*"?

Can you try "开关*"?

What you are seeing is the analyzer that analyzes the search query string, it splits "开关" into two parts, not the analyzer that was used or specified for the "text" field. I think you can look up for a way to specify the analyzer that you prefer to use with your query string, otherwise it will use the default. Please check.

Please pay attention to the first line of this topic. I have set smartcn as the default analyzer.

Hey @Morriaty

There issue is with the tokenizer, you know the SmartCN will do tokenization against the sentence, and SmartCN is using HMM algorithm to compute how to segment the text, and context matters, so long text and short text may have different tokenization result, maybe you can try this analyzer:

and try to use ik_max_word .

Hi @medcl1 medcl, I suppose we could talk in Chinese.:joy:
事实上,我已经用过ik了,结果跟上面的一样。这是我现在用的template。还是说我应该在template中只写ik_max_word

{
  "template": "newmall",
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "ik": {
          "type": "ik"
        },
        "ik_max_word": {
          "type": "ik",
          "use_smart": false
        },
        "ik_smart": {
          "type": "ik",
          "use_smart": true
        }
      }
    }
  }
}
PUT /newmall
{
  "mappings": {
    "mall":{
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "ik"
        }
      }
    }
  }
}

POST /newmall/mall/1
{
  "text": "开关/插座 -代金券-西门子"
}
GET /newmall/mall/_search?explain
{
  "query": {
    "match": {
      "text": "开关"
    }
  }
}

you don't need to custom the analyzer, they are ready to use directly, try the example show above.
if you are using template, you must recreate the index.

@medcl1: Thanks for chiming in! :slight_smile: