Confused with the Smart Chinese Analysis plugin

Morriaty · April 8, 2016, 8:52am

I set smartcn as default analyzer in config file

index.analysis.analyzer.default.type : "smartcn"

And I create a type item which had a string field name, like this:

"name": "开关/插座 -代金券-西门子"

You can see it as A/B - C-D if you are confused with Chinese.
I tried the analyze api

GET /newmall/_analyze
{
  "text":"开关/插座 -代金券-西门子"
}

The text successfully cut into “A”, "B", "C", "D". All the stopwords and punctuation were deprecated.
When I searched “B", "C", "D" separately, it all returned the right doc.
However, when I searched “A”, which is "开关", it hit zero.

GET /newmall/item/_search
{
  "query": {
    "match": {
      "name": "开关"
    }
  }
}

And the funny thing is that when I tried with the term “A/”, which is "开关/". It returned the right doc.

GET /newmall/item/_search
{
  "query": {
    "match": {
      "name": "开关/"
    }
  }
}

danielmitterdorfer · April 12, 2016, 8:14am

Hi,

I have to admit that I have not the slightest clue about Chinese so bear with me.

Based on what I see, the analysis process does not seem to emit the token 开关 correctly. Could you please post the response of:

GET /newmall/_analyze
{
  "text":"开关/插座 -代金券-西门子"
}

Daniel

forloop · April 12, 2016, 8:56am

I've just tried this on Elasticsearch 2.3.0 and get the same issue. Here are the steps to reproduce

1.install smart chinese analysis plugin

sudo bin/plugin install analysis-smartcn

2.create an index

curl -XPUT "http://localhost:9200/newmall" -d'
{
  "mappings": {
    "mall":{
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "smartcn"
        }
      }
    }
  }
}'

3.Test the analysis of "开关/插座 -代金券-西门子"

curl -XGET "http://localhost:9200/newmall/_analyze?analyzer=smartcn" -d'
{
  "text":"开关/插座 -代金券-西门子"
}'

yields the correct tokens

{
  "tokens": [
    {
      "token": "开关",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "插座",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "代金",
      "start_offset": 7,
      "end_offset": 9,
      "type": "word",
      "position": 4
    },
    {
      "token": "券",
      "start_offset": 9,
      "end_offset": 10,
      "type": "word",
      "position": 5
    },
    {
      "token": "西门子",
      "start_offset": 11,
      "end_offset": 14,
      "type": "word",
      "position": 7
    }
  ]
}

4.Index a document with text we just analyzed

curl -XPOST "http://localhost:9200/newmall/mall/1" -d'
{
  "text": "开关/插座 -代金券-西门子"
}'

5.Perform match query on "开关"

curl -XGET "http://localhost:9200/newmall/mall/_search?explain" -d'
{
  "query": {
    "match": {
      "text": "开关"
    }
  }
}'

yields no results

6.But performing match query on "开关/"

curl -XGET "http://localhost:9200/newmall/mall/_search?explain" -d'
{
  "query": {
    "match": {
      "text": "开关/"
    }
  }
}'

yields results (with explanation)

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.13424811,
    "hits": [
      {
        "_shard": 3,
        "_node": "F4_VMt-7Qpi11ACD3y4QYw",
        "_index": "newmall",
        "_type": "mall",
        "_id": "1",
        "_score": 0.13424811,
        "_source": {
          "text": "开关/插座 -代金券-西门子"
        },
        "_explanation": {
          "value": 0.13424811,
          "description": "sum of:",
          "details": [
            {
              "value": 0.13424811,
              "description": "weight(text:开关 in 0) [PerFieldSimilarity], result of:",
              "details": [
                {
                  "value": 0.13424811,
                  "description": "fieldWeight in 0, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "tf(freq=1.0), with freq of:",
                      "details": [
                        {
                          "value": 1,
                          "description": "termFreq=1.0",
                          "details": []
                        }
                      ]
                    },
                    {
                      "value": 0.30685282,
                      "description": "idf(docFreq=1, maxDocs=1)",
                      "details": []
                    },
                    {
                      "value": 0.4375,
                      "description": "fieldNorm(doc=0)",
                      "details": []
                    }
                  ]
                }
              ]
            },
            {
              "value": 0,
              "description": "match on required clause, product of:",
              "details": [
                {
                  "value": 0,
                  "description": "# clause",
                  "details": []
                },
                {
                  "value": 3.2588913,
                  "description": "_type:mall, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": 3.2588913,
                      "description": "queryNorm",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

Morriaty · April 12, 2016, 9:09am

@forloop had reproduced the issue below

Morriaty · April 12, 2016, 9:14am

That's exactly the issue what I had met.
Thank you for you reproducing. My poor English may not explain the issue clearly.

danielmitterdorfer · April 12, 2016, 10:57am

I tried two more things:

GET /newmall/_analyze?analyzer=smartcn
{
    "text": "开关"
}

This returns:

{
   "tokens": [
      {
         "token": "开",
         "start_offset": 0,
         "end_offset": 1,
         "type": "word",
         "position": 0
      },
      {
         "token": "关",
         "start_offset": 1,
         "end_offset": 2,
         "type": "word",
         "position": 1
      }
   ]
}

Whereas:

GET /newmall/_analyze?analyzer=smartcn
{
    "text": "开关/"
}

returns

{
   "tokens": [
      {
         "token": "开关",
         "start_offset": 0,
         "end_offset": 2,
         "type": "word",
         "position": 0
      }
   ]
}

So it appears to me that the problem is in the search phase. For a Western language I'd say this is weird but I don't know whether it makes sense to tokenize "开关" as two separate tokens or they always belong together.

As an alternative you could try if you get better results with the icu-analysis plugin.

Daniel

Morriaty · April 12, 2016, 12:26pm

Hi, Daniel
开 means switch on, while 关 means switch off. Both are verbs.
If combined together, 开关 means noun switch.

So it's the problem of the words segmentation algorithm?
When indexing, the smartcn analyzer cut the text into ["开关", "插座", "代金券", "西门子"].
When searching, the smartcn analyzer cut the query "开关" into ["开", "关"].

But why cannot the term (or called token?) "开" and “关” match "开关"?
As contrast, could "turn" or "over" match "turnover"?

thn · April 12, 2016, 12:33pm

Can you try "开*"?

Can you try "开关*"?

What you are seeing is the analyzer that analyzes the search query string, it splits "开关" into two parts, not the analyzer that was used or specified for the "text" field. I think you can look up for a way to specify the analyzer that you prefer to use with your query string, otherwise it will use the default. Please check.

Morriaty · April 13, 2016, 1:19am

Please pay attention to the first line of this topic. I have set smartcn as the default analyzer.

medcl.net · April 13, 2016, 2:32am

Hey @Morriaty

There issue is with the tokenizer, you know the SmartCN will do tokenization against the sentence, and SmartCN is using HMM algorithm to compute how to segment the text, and context matters, so long text and short text may have different tokenization result, maybe you can try this analyzer：

and try to use ik_max_word .

Morriaty · April 13, 2016, 2:52am

Hi @medcl1 medcl, I suppose we could talk in Chinese.
事实上，我已经用过ik了，结果跟上面的一样。这是我现在用的template。还是说我应该在template中只写ik_max_word？

{
  "template": "newmall",
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "ik": {
          "type": "ik"
        },
        "ik_max_word": {
          "type": "ik",
          "use_smart": false
        },
        "ik_smart": {
          "type": "ik",
          "use_smart": true
        }
      }
    }
  }
}

medcl.net · April 13, 2016, 3:00am

PUT /newmall
{
  "mappings": {
    "mall":{
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "ik"
        }
      }
    }
  }
}

POST /newmall/mall/1
{
  "text": "开关/插座 -代金券-西门子"
}
GET /newmall/mall/_search?explain
{
  "query": {
    "match": {
      "text": "开关"
    }
  }
}

you don't need to custom the analyzer, they are ready to use directly, try the example show above.
if you are using template, you must recreate the index.

danielmitterdorfer · April 13, 2016, 6:48am

@medcl1: Thanks for chiming in!

Topic		Replies	Views
Smart Chinese Analyzer returns numbers instead of chinese tokens Elasticsearch	1	515	July 5, 2017
Smart Chinese Analysis returns unicodes instead of chinese tokens Elasticsearch	6	1230	July 5, 2017
How to use analyzer from plugin? Elasticsearch	5	1260	July 6, 2017
[ANN] Elasticsearch Smart Chinese Analysis plugin 2.3.0 released Elasticsearch	1	337	July 6, 2017
[ANN] Elasticsearch Smart Chinese Analysis plugin 2.2.0 released Elasticsearch	1	385	July 6, 2017

Confused with the Smart Chinese Analysis plugin

Related topics