Convert existing character encoding in ElasticSearch

The was ElasticSearch received JSON from Java was amended from ISO-8891-1 to UTF-8 because we were getting invalid JSON when symbols such as the copyright sign were at the start of a field.

However, I now notice that since this change characters such as é are being stored in ElasticSearch as é

This is because we were still amending the encoding of the string prior to ElasticSearch insert as follows:

                    byte bytes[] = newString.getBytes("UTF-8");
		newString = new String(bytes, "ISO-8859-1");

When I take out this string formatting it now once again inserts correctly into ElasticSearch.

However, we have tens of thousands of records which have been inserted with words such as "naïve" instead of "naïve" etc. - I was wondering if there is any way of converting these back to their UTF-8 equivalent?

Taking the character conversion from https://www.i18nqa.com/debug/utf8-debug.html - I have tried Logstash to read from one index and write to a new one with a filter as follows:
filter {
mutate {
gsub => [
# replace backslashes, question marks, hashes, and minuses
# with a dot "."
"ARTICLE_TITLE", "ï", "ï",
"ARTICLE_TITLE", "é", "é",
"ARTICLE_TITLE", "ú", "ú",
"ARTICLE_TITLE", "ů", "ů",
"ARTICLE_TITLE", "í", "í",
"ARTICLE_TITLE", "á", "á",
"ARTICLE_TITLE", "ř", "ř",
"ARTICLE_TITLE", "×", "×",
"ARTICLE_TITLE", "æ", "æ",
"ARTICLE_TITLE", "ó", "ó"
]

But there are many, many more example characters and a good chance some may be missed using a one to one character mapping - I was wondering if there is a more efficient way of converting all the characters interpreted as Windows-1252 (or ISO 8859-1) bytes to UTF-8 bytes in either logstash or by running an ElasticSearch update?

Many thanks,
Mark

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.