Convert existing character encoding in ElasticSearch

markmevans · May 22, 2019, 1:48pm

The was ElasticSearch received JSON from Java was amended from ISO-8891-1 to UTF-8 because we were getting invalid JSON when symbols such as the copyright sign were at the start of a field.

However, I now notice that since this change characters such as é are being stored in ElasticSearch as Ã©

This is because we were still amending the encoding of the string prior to ElasticSearch insert as follows:

                    byte bytes[] = newString.getBytes("UTF-8");
		newString = new String(bytes, "ISO-8859-1");

When I take out this string formatting it now once again inserts correctly into ElasticSearch.

However, we have tens of thousands of records which have been inserted with words such as "naÃ¯ve" instead of "naïve" etc. - I was wondering if there is any way of converting these back to their UTF-8 equivalent?

Taking the character conversion from https://www.i18nqa.com/debug/utf8-debug.html - I have tried Logstash to read from one index and write to a new one with a filter as follows:
filter {
mutate {
gsub => [
# replace backslashes, question marks, hashes, and minuses
# with a dot "."
"ARTICLE_TITLE", "Ã¯", "ï",
"ARTICLE_TITLE", "Ã©", "é",
"ARTICLE_TITLE", "Ãº", "ú",
"ARTICLE_TITLE", "Å¯", "ů",
"ARTICLE_TITLE", "Ã", "í",
"ARTICLE_TITLE", "Ã¡", "á",
"ARTICLE_TITLE", "Å", "ř",
"ARTICLE_TITLE", "Ã", "×",
"ARTICLE_TITLE", "Ã¦", "æ",
"ARTICLE_TITLE", "Ã³", "ó"
]

But there are many, many more example characters and a good chance some may be missed using a one to one character mapping - I was wondering if there is a more efficient way of converting all the characters interpreted as Windows-1252 (or ISO 8859-1) bytes to UTF-8 bytes in either logstash or by running an ElasticSearch update?

Many thanks,
Mark

system · June 19, 2019, 1:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Enconding Issue Elasticsearch	4	2967	July 6, 2017
What to do with non utf8 characters Elasticsearch	18	19710	July 5, 2017
How's the encoding handling power of ES? Elasticsearch	3	342	July 6, 2017
Convert Field Data to UTF8 Kibana	4	3509	August 10, 2017
How elasticsearch encodes strings with special characters before storing Elasticsearch	5	6224	July 5, 2017

Convert existing character encoding in ElasticSearch

Related topics