I was trying to parse data to Elasticsearch using Logstash and got the following error message:
Received an event that has a different character encoding than you configured. {:text=>"H\\tC\\tFCE5C9C1-CF1F-4593-9C23-B2D2F886DDC9\\tApex Matting & Foodservice Products\\tF3F7DFA6-BE0D-DD11-A23A-00304834A8C9\\t170S0035BD\\tFloor Mat, Carpet\\tOrientax\\tChicago\\tIL\\t60638\\t170 Orientrax\\x99 Nylon Mat, 3' x 5', twisted nylon fiber for moisture absorption,
anti-slip backing, oriental design, burgundy\\t180.18\\t0.0\\t \\t0.0\\t \\t \\t1\\tea\\t \\t0.0\\t9.0\\t \\t0.0\\t60.0\\t36.0\\ttrue\\t0\\tfalse\\tfalse\\t29854E73-49D6-409A-8A48-A9264FAE5703\\t \\t \\t \\t \\
t \\t \\t \\tfalse\\tlistPrice\\r", :expected_charset=>"UTF-8"}
Is it possible that the error is because I am passing Orientrax™ as one of the fields which I have mapped in Elasticsearch as "keyword" type? Does keyword data type not accept superscripts? If so, how may I fix this?
This error message comes from Logstash and not Elasticsearch. You're sending data that isn't UTF-8. Specifically, the ™ is in your case represented as hexadecimal 99 (decimal 153) which indicates that your data isn't UTF-8 but CP-1252. Either reconfigure Logstash to expect CP-1252 (you can reconfigure your input's codec) or change the data so it conforms to UTF-8.
Thanks for your response. I changed the character encoding in the "codec" input plugin to "CP1252" and as expected, it only pushed data which conformed to CP1252. However, I still need the rest of the data (which have plain UTF-8 encoding) to go through. Is it possible to allow data that conforms to both types?
OK I was wrong previously. Adding the encoding "CP1252" allows UTF-8 data and CP1252 to get pushed. Therefore, it works perfectly now. Thank you! For anyone else's reference, my input config looks like this (Windows machine):
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.