Smart Chinese Analysis returns unicodes instead of chinese tokens


(Bogdan Petea) #1

I'm trying to analyse documents in Elasticsearch using Smart Chinese Analyser, but, instead of getting the analysed Chinese characters, Elasticsearch returns the unicodes of these characters. For example:

PUT /test_chinese
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "smartcn"
}
}
}
}
}
}

GET /test_chinese/_analyze?text='我说世界好!'
I expect to get every chinese character, but I get:

{
"tokens": [
{
"token": "25105",
"start_offset": 3,
"end_offset": 8,
"type": "word",
"position": 4
},
{
"token": "35828",
"start_offset": 11,
"end_offset": 16,
"type": "word",
"position": 8
},
{
"token": "19990",
"start_offset": 19,
"end_offset": 24,
"type": "word",
"position": 12
},
{
"token": "30028",
"start_offset": 27,
"end_offset": 32,
"type": "word",
"position": 16
},
{
"token": "22909",
"start_offset": 35,
"end_offset": 40,
"type": "word",
"position": 20
}
]
}
Do you have any idea what's going on?

Thank you!


(Zachary Tong) #2

Hmm, are you executing this command through the browser (Sense, Postman, etc)? I just tried on Sense and see the same thing. But when I look at the network traffic, it looks like Sense is url-encoding the characters or something (admittedly unicode is not my forte)?

Here is what is actually being sent to ES:

http://127.0.0.1:9200/test_chinese/_analyze?text=%27%26%2325105%3B%26%2335828%3B%26%2319990%3B%26%2330028%3B%26%2322909%3B!%27

Which urldecodes to:

'我说世界好!'

Which lines up with the tokens you see: numerics with the special characters stripped.

If I run the same thing on the command line, I get back proper chinese characters:

$ curl -XGET -v "http://127.0.0.1:9200/test_chinese/_analyze?text='我说世界好!'&pretty"

* Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 9200 (#0)
> GET /test_chinese/_analyze?text='我说世界好!'&pretty HTTP/1.1
> Host: 127.0.0.1:9200
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 1749
< 
{
  "tokens" : [ {
    "token" : "₩",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "ネ",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "ム",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "│",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "ᆵ",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "ᄡ",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "¦",
    "start_offset" : 7,
    "end_offset" : 8,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "ᄌ",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "ヨ",
    "start_offset" : 9,
    "end_offset" : 10,
    "type" : "word",
    "position" : 9
  }, {
    "token" : "￧",
    "start_offset" : 10,
    "end_offset" : 11,
    "type" : "word",
    "position" : 10
  }, {
    "token" : "ユ",
    "start_offset" : 11,
    "end_offset" : 12,
    "type" : "word",
    "position" : 11
  }, {
    "token" : "フ",
    "start_offset" : 12,
    "end_offset" : 13,
    "type" : "word",
    "position" : 12
  }, {
    "token" : "¥",
    "start_offset" : 13,
    "end_offset" : 14,
    "type" : "word",
    "position" : 13
  }, {
    "token" : "ᆬ",
    "start_offset" : 14,
    "end_offset" : 15,
    "type" : "word",
    "position" : 14
  }, {
    "token" : "ᄑ",
    "start_offset" : 15,
    "end_offset" : 16,
    "type" : "word",
    "position" : 15
  } ]
}

So it seems to be an encoding problem on the client side. Are you using Sense? If yes, what version (the old one in Marvel, or the newer one as a Kibana app in 2.0)?


(Bogdan Petea) #3

I just descovered the same problem.

Yes, I use Sense from Marvel 1.3.1. When I try indexing with Sense from Google Chrome, I get the unicodes of Chinese characters. I will investigate too and come back with an answer.


(Zachary Tong) #4

I just tested in Sense 2.0.0-beta1 (the Kibana app version) and the results are similar:

Sent over the wire:

http://localhost:5601/api/sense/proxy?uri=http%3A%2F%2Flocalhost%3A9200%2Ftest_chinese%2F_analyze%3Ftext%3D%27%E6%88%91%E8%AF%B4%E4%B8%96%E7%95%8C%E5%A5%BD!%27&_=1450180681591

Result:

{
  "tokens": [
    {
      "token": "\u0011",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "�",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "\u0016",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    },
    {
      "token": "l",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 4
    }
  ]
}

I'll open a ticket, this seems like a bug in Sense.


(Zachary Tong) #5

Ticket opened: https://github.com/elastic/sense/issues/88


(Bogdan Petea) #6

Thank you!


(system) #7