Getting characters to be stored correctly


(Tonni Hult) #1

Hi

I am having a problem to get for exampel chinese characters to be stored correctly in ElasticSearch (we use version 2.3.2).

In my index I have a mapping for the field Username that is set like this

"Username": {
  "type": "string",
  "index": "not_analyzed"
}

We don't need to analyze the field but we must be able to handle usernames in russian, chinese, arabic etc. Testing the mapping to see what token it produces seems to give the right result when calling http://localhost:9200/logstash-dev-2016.09.19/_analyze?field=Username&text=灵铃理立

{
"tokens": [
{
"token": "灵铃理立",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}

But when I insert a document in ES and then search for it in Kibana (4.5.4) I get a result like �铃�立
What am I doing wrong?

Thank you in advance


(Adrien Grand) #2

Can you try to force your browser to use UTF8 as an encoding? Does it fix the issue?


(Tonni Hult) #3

I tested with Chrome and that setting is not available for me, the encoding menu is greyed out. But I don't think that would solve the problem because when I use Postman and do the same search there I get the same result.


(Adrien Grand) #4

Well there is something that messes up with the encoding. I don't think the issue is with Elasticsearch itself as there are many users that use it with chinese or russian content successfully.

When you call the search API the same way as you call the analyze API (since the latter seems to work for you), do you still have encoding issues? If yes then the issue might be at index time due to a client that uses a different encoding than the one it declares.


(Tonni Hult) #5

Well you're right it doesn't seem to be ES because I've found that the username is wrong even before it is inserted into ES. Nxlog is used as our shipper and from there the username looks ok but when I looked the output from LS the username is wrong. In between is Redis and a loadbalancer so I need to find where the problem starts


(Tonni Hult) #6

Finally found the problem! In my input configuration in LS I had the following setting

{charset => "CP1252"}

Removing that made it default back to UTF-8 and all works fine! Thanks for pointing me in the right direction


(system) #7