Getting characters to be stored correctly

Tonni_Hult · September 20, 2016, 7:58am

Hi

I am having a problem to get for exampel chinese characters to be stored correctly in ElasticSearch (we use version 2.3.2).

In my index I have a mapping for the field Username that is set like this

"Username": {
  "type": "string",
  "index": "not_analyzed"
}

We don't need to analyze the field but we must be able to handle usernames in russian, chinese, arabic etc. Testing the mapping to see what token it produces seems to give the right result when calling http://localhost:9200/logstash-dev-2016.09.19/_analyze?field=Username&text=灵铃理立

{
"tokens": [
{
"token": "灵铃理立",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}

But when I insert a document in ES and then search for it in Kibana (4.5.4) I get a result like ç�µé“ƒç�†ç«‹
What am I doing wrong?

Thank you in advance

jpountz · September 20, 2016, 10:31am

Can you try to force your browser to use UTF8 as an encoding? Does it fix the issue?

Tonni_Hult · September 21, 2016, 6:52am

I tested with Chrome and that setting is not available for me, the encoding menu is greyed out. But I don't think that would solve the problem because when I use Postman and do the same search there I get the same result.

jpountz · September 21, 2016, 7:22am

Well there is something that messes up with the encoding. I don't think the issue is with Elasticsearch itself as there are many users that use it with chinese or russian content successfully.

When you call the search API the same way as you call the analyze API (since the latter seems to work for you), do you still have encoding issues? If yes then the issue might be at index time due to a client that uses a different encoding than the one it declares.

Tonni_Hult · September 21, 2016, 11:05am

Well you're right it doesn't seem to be ES because I've found that the username is wrong even before it is inserted into ES. Nxlog is used as our shipper and from there the username looks ok but when I looked the output from LS the username is wrong. In between is Redis and a loadbalancer so I need to find where the problem starts

Tonni_Hult · September 22, 2016, 2:26pm

Finally found the problem! In my input configuration in LS I had the following setting

{charset => "CP1252"}

Removing that made it default back to UTF-8 and all works fine! Thanks for pointing me in the right direction

Topic		Replies	Views
Index and search "à" char Elasticsearch	4	790	July 20, 2017
Smart Chinese Analysis returns unicodes instead of chinese tokens Elasticsearch	6	1237	July 5, 2017
Content encoding issues Elasticsearch	4	1295	July 6, 2017
Bad charset encoding in field names (II) Elasticsearch	6	746	June 20, 2018
Chinese characters as field names using logstash Logstash	2	788	January 6, 2017

Getting characters to be stored correctly

Related topics