Hmm, are you executing this command through the browser (Sense, Postman, etc)? I just tried on Sense and see the same thing. But when I look at the network traffic, it looks like Sense is url-encoding the characters or something (admittedly unicode is not my forte)?
Here is what is actually being sent to ES:
http://127.0.0.1:9200/test_chinese/_analyze?text=%27%26%2325105%3B%26%2335828%3B%26%2319990%3B%26%2330028%3B%26%2322909%3B!%27
Which urldecodes to:
'我说世界好!'
Which lines up with the tokens you see: numerics with the special characters stripped.
If I run the same thing on the command line, I get back proper chinese characters:
$ curl -XGET -v "http://127.0.0.1:9200/test_chinese/_analyze?text='我说世界好!'&pretty"
* Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 9200 (#0)
> GET /test_chinese/_analyze?text='我说世界好!'&pretty HTTP/1.1
> Host: 127.0.0.1:9200
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 1749
<
{
"tokens" : [ {
"token" : "₩",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 1
}, {
"token" : "ネ",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 2
}, {
"token" : "ム",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 3
}, {
"token" : "│",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 4
}, {
"token" : "ᆵ",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 5
}, {
"token" : "ᄡ",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 6
}, {
"token" : "¦",
"start_offset" : 7,
"end_offset" : 8,
"type" : "word",
"position" : 7
}, {
"token" : "ᄌ",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 8
}, {
"token" : "ヨ",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 9
}, {
"token" : "",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 10
}, {
"token" : "ユ",
"start_offset" : 11,
"end_offset" : 12,
"type" : "word",
"position" : 11
}, {
"token" : "フ",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 12
}, {
"token" : "¥",
"start_offset" : 13,
"end_offset" : 14,
"type" : "word",
"position" : 13
}, {
"token" : "ᆬ",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 14
}, {
"token" : "ᄑ",
"start_offset" : 15,
"end_offset" : 16,
"type" : "word",
"position" : 15
} ]
}
So it seems to be an encoding problem on the client side. Are you using Sense? If yes, what version (the old one in Marvel, or the newer one as a Kibana app in 2.0)?