Smart Chinese Analysis returns unicodes instead of chinese tokens

Bogdan_Petea · December 14, 2015, 2:30pm

I'm trying to analyse documents in Elasticsearch using Smart Chinese Analyser, but, instead of getting the analysed Chinese characters, Elasticsearch returns the unicodes of these characters. For example:

PUT /test_chinese
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "smartcn"
}
}
}
}
}
}

GET /test_chinese/_analyze?text='我说世界好!'
I expect to get every chinese character, but I get:

{
"tokens": [
{
"token": "25105",
"start_offset": 3,
"end_offset": 8,
"type": "word",
"position": 4
},
{
"token": "35828",
"start_offset": 11,
"end_offset": 16,
"type": "word",
"position": 8
},
{
"token": "19990",
"start_offset": 19,
"end_offset": 24,
"type": "word",
"position": 12
},
{
"token": "30028",
"start_offset": 27,
"end_offset": 32,
"type": "word",
"position": 16
},
{
"token": "22909",
"start_offset": 35,
"end_offset": 40,
"type": "word",
"position": 20
}
]
}
Do you have any idea what's going on?

Thank you!

polyfractal · December 14, 2015, 6:06pm

Hmm, are you executing this command through the browser (Sense, Postman, etc)? I just tried on Sense and see the same thing. But when I look at the network traffic, it looks like Sense is url-encoding the characters or something (admittedly unicode is not my forte)?

Here is what is actually being sent to ES:

http://127.0.0.1:9200/test_chinese/_analyze?text=%27%26%2325105%3B%26%2335828%3B%26%2319990%3B%26%2330028%3B%26%2322909%3B!%27

Which urldecodes to:

'&#25105;&#35828;&#19990;&#30028;&#22909;!'

Which lines up with the tokens you see: numerics with the special characters stripped.

If I run the same thing on the command line, I get back proper chinese characters:

$ curl -XGET -v "http://127.0.0.1:9200/test_chinese/_analyze?text='我说世界好!'&pretty"

* Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 9200 (#0)
> GET /test_chinese/_analyze?text='我说世界好!'&pretty HTTP/1.1
> Host: 127.0.0.1:9200
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 1749
< 
{
  "tokens" : [ {
    "token" : "￦",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "ﾈ",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "ﾑ",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "￨",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "ﾯ",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "ﾴ",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "￤",
    "start_offset" : 7,
    "end_offset" : 8,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "ﾸ",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "ﾖ",
    "start_offset" : 9,
    "end_offset" : 10,
    "type" : "word",
    "position" : 9
  }, {
    "token" : "￧",
    "start_offset" : 10,
    "end_offset" : 11,
    "type" : "word",
    "position" : 10
  }, {
    "token" : "ﾕ",
    "start_offset" : 11,
    "end_offset" : 12,
    "type" : "word",
    "position" : 11
  }, {
    "token" : "ﾌ",
    "start_offset" : 12,
    "end_offset" : 13,
    "type" : "word",
    "position" : 12
  }, {
    "token" : "￥",
    "start_offset" : 13,
    "end_offset" : 14,
    "type" : "word",
    "position" : 13
  }, {
    "token" : "ﾥ",
    "start_offset" : 14,
    "end_offset" : 15,
    "type" : "word",
    "position" : 14
  }, {
    "token" : "ﾽ",
    "start_offset" : 15,
    "end_offset" : 16,
    "type" : "word",
    "position" : 15
  } ]
}

So it seems to be an encoding problem on the client side. Are you using Sense? If yes, what version (the old one in Marvel, or the newer one as a Kibana app in 2.0)?

Bogdan_Petea · December 15, 2015, 7:07am

I just descovered the same problem.

Yes, I use Sense from Marvel 1.3.1. When I try indexing with Sense from Google Chrome, I get the unicodes of Chinese characters. I will investigate too and come back with an answer.

polyfractal · December 15, 2015, 12:01pm

I just tested in Sense 2.0.0-beta1 (the Kibana app version) and the results are similar:

Sent over the wire:

http://localhost:5601/api/sense/proxy?uri=http%3A%2F%2Flocalhost%3A9200%2Ftest_chinese%2F_analyze%3Ftext%3D%27%E6%88%91%E8%AF%B4%E4%B8%96%E7%95%8C%E5%A5%BD!%27&_=1450180681591

Result:

{
  "tokens": [
    {
      "token": "\u0011",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "�",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "\u0016",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    },
    {
      "token": "l",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 4
    }
  ]
}

I'll open a ticket, this seems like a bug in Sense.

polyfractal · December 15, 2015, 12:44pm

Ticket opened: https://github.com/elastic/sense/issues/88

Bogdan_Petea · December 15, 2015, 2:09pm

Thank you!

Topic		Replies	Views
Smart Chinese Analyzer returns numbers instead of chinese tokens Elasticsearch	1	521	July 5, 2017
Chinese character Elasticsearch	2	701	July 6, 2017
Asian characters and not words are tokenized - CJK Analysis and Tokenization Problems Elasticsearch	8	722	July 6, 2017
Analyze with smartcn get messy code Elasticsearch	2	425	January 6, 2017
Chinese Language Analyzer or CJK Elasticsearch	1	394	July 6, 2017

Smart Chinese Analysis returns unicodes instead of chinese tokens

Related topics