Smart Chinese Analysis returns unicodes instead of chinese tokens

I'm trying to analyse documents in Elasticsearch using Smart Chinese Analyser, but, instead of getting the analysed Chinese characters, Elasticsearch returns the unicodes of these characters. For example:

PUT /test_chinese
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "smartcn"
}
}
}
}
}
}

GET /test_chinese/_analyze?text='我说世界好!'
I expect to get every chinese character, but I get:

{
"tokens": [
{
"token": "25105",
"start_offset": 3,
"end_offset": 8,
"type": "word",
"position": 4
},
{
"token": "35828",
"start_offset": 11,
"end_offset": 16,
"type": "word",
"position": 8
},
{
"token": "19990",
"start_offset": 19,
"end_offset": 24,
"type": "word",
"position": 12
},
{
"token": "30028",
"start_offset": 27,
"end_offset": 32,
"type": "word",
"position": 16
},
{
"token": "22909",
"start_offset": 35,
"end_offset": 40,
"type": "word",
"position": 20
}
]
}
Do you have any idea what's going on?

Thank you!

Hmm, are you executing this command through the browser (Sense, Postman, etc)? I just tried on Sense and see the same thing. But when I look at the network traffic, it looks like Sense is url-encoding the characters or something (admittedly unicode is not my forte)?

Here is what is actually being sent to ES:

http://127.0.0.1:9200/test_chinese/_analyze?text=%27%26%2325105%3B%26%2335828%3B%26%2319990%3B%26%2330028%3B%26%2322909%3B!%27

Which urldecodes to:

'我说世界好!'

Which lines up with the tokens you see: numerics with the special characters stripped.

If I run the same thing on the command line, I get back proper chinese characters:

$ curl -XGET -v "http://127.0.0.1:9200/test_chinese/_analyze?text='我说世界好!'&pretty"

* Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 9200 (#0)
> GET /test_chinese/_analyze?text='我说世界好!'&pretty HTTP/1.1
> Host: 127.0.0.1:9200
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 1749
< 
{
  "tokens" : [ {
    "token" : "₩",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "ネ",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "ム",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "│",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "ᆵ",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "ᄡ",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "¦",
    "start_offset" : 7,
    "end_offset" : 8,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "ᄌ",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "ヨ",
    "start_offset" : 9,
    "end_offset" : 10,
    "type" : "word",
    "position" : 9
  }, {
    "token" : "￧",
    "start_offset" : 10,
    "end_offset" : 11,
    "type" : "word",
    "position" : 10
  }, {
    "token" : "ユ",
    "start_offset" : 11,
    "end_offset" : 12,
    "type" : "word",
    "position" : 11
  }, {
    "token" : "フ",
    "start_offset" : 12,
    "end_offset" : 13,
    "type" : "word",
    "position" : 12
  }, {
    "token" : "¥",
    "start_offset" : 13,
    "end_offset" : 14,
    "type" : "word",
    "position" : 13
  }, {
    "token" : "ᆬ",
    "start_offset" : 14,
    "end_offset" : 15,
    "type" : "word",
    "position" : 14
  }, {
    "token" : "ᄑ",
    "start_offset" : 15,
    "end_offset" : 16,
    "type" : "word",
    "position" : 15
  } ]
}

So it seems to be an encoding problem on the client side. Are you using Sense? If yes, what version (the old one in Marvel, or the newer one as a Kibana app in 2.0)?

I just descovered the same problem.

Yes, I use Sense from Marvel 1.3.1. When I try indexing with Sense from Google Chrome, I get the unicodes of Chinese characters. I will investigate too and come back with an answer.

I just tested in Sense 2.0.0-beta1 (the Kibana app version) and the results are similar:

Sent over the wire:

http://localhost:5601/api/sense/proxy?uri=http%3A%2F%2Flocalhost%3A9200%2Ftest_chinese%2F_analyze%3Ftext%3D%27%E6%88%91%E8%AF%B4%E4%B8%96%E7%95%8C%E5%A5%BD!%27&_=1450180681591

Result:

{
  "tokens": [
    {
      "token": "\u0011",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "�",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "\u0016",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    },
    {
      "token": "l",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 4
    }
  ]
}

I'll open a ticket, this seems like a bug in Sense.

Ticket opened: https://github.com/elastic/sense/issues/88

Thank you!