Hello and thank you in advance for your help.
I can't seem to get the UTF-8 encoding for my logstash output right. It always looks like UTF-8 data has been interpreted as ISO-8859-1. My data sources have been a MySQL database (queried via JDBC input), a PHP script with JSON output (loaded with the HTTP poller) and stdin (for testing purposes). There is one common denominator: The ES output looks good, MySQL and stdout don't.
Let's take "fühlen" as an example: It's shown as "fühlen" in Kibana, but becomes "fühlen" in my database (JDBC output plugin) and the plain output (stdout). I tried the following test configuration:
Result:
stdout: 2018-04-04T09:58:13.247Z elasticsearch-vm fühlen
MySQL field_a: fühlen
MySQL field_b: fühlen
If I add another field with 'add_field => { "message2" => "fühlen" }' and try a rubydebug output, I get:
{
"message" => "f\xC3\xBChlen",
"@timestamp" => 2018-04-04T10:15:09.669Z,
"message2" => "fühlen",
"host" => "elasticsearch-vm",
"@version" => "1"
}
Does anyone have an idea what I am doing wrong? I'm still hoping that it's easy to fix and I'm missing the forest for the trees. Charsets will be the death of me ...
Best regards,
Jennifer
Not actually being helpful and answering your question here, but
Charsets will be the death of me ...
the story is that there's a question on the Cambridge University computer science exam that goes something like "explain why even experienced programmers have problems with character sets". They've been asking this question since the 1960s and look forward each year to seeing what the students will come up with this time (Once Upon A Time it was five-track paper tape escape codes, in my day it was ASCII vs EBCDIC, these days it might be Chinese email subject lines ...).
Thank you. It's a little bit comforting that I am not the only lonely desperate soul in this fight. Nevertheless I am still hoping for a knight in shining armour, who will save me from these ugly characters.
Can anybody help?
So my knight is wearing sun glasses – very contemporary
Thank you very much for taking the time to deal with my problem. I really appreciate it.
And thank you for the "locale charmap" tip. That's good to know. Buuut:
I used the same code for my http poller pipeline and was able to import this into the MySQL database:
Échame La Culpa
Now I'll just have to wait a few hours until a potentially problematic entry occurs in the other database. If that can be imported successfully, I'm quite happy. It seems like a cumbersome solution for a problem I shouldn't even have. But at least now I'll have nice data.
If someone has a better idea, feel free to post!
UPDATE: "Immer noch fühlen" has been imported from the other database successfully!
Aaaaaaand the story continues. And 'it's kind of a funny story': The previous database table was latin1_swedish_ci. There I managed to get correct UTF-8 strings. NOW I am querying an utf8_general_ci table and all I get is gibberish I tried to solve the issue with Ruby converting back and forth to UTF-8, UTF-16, ISO-8859-1 etc., but I can't get it to work for the life of me.
There aren't that many entries like that, but I need to get rid of artists like "Herbert Grönemeyer" ...
I would be really happy, if someone a) knows an easy solution or b) could help me install the charlock_holmes Ruby gem or any other lib that might save me. I tried env GEM_HOME=/opt/logstash/vendor/bundle/jruby/1.9 /usr/share/logstash/vendor/jruby/bin/jruby /usr/share/logstash/vendor/jruby/bin/gem install charlock_holmes
as well as installing a bunch of dependencies that were mentioned in different forums (libgmp-dev, libgmp3-dev,libpcre3, libpcre3-dev, git-core, curl, zlib1g-dev, build-essential, libssl-dev, libreadline-dev,libyaml-dev, libsqlite3-dev, sqlite3, libxml2-dev, libxslt1-dev, libcurl4-openssl-dev, python-software-properties, libffi-dev, make, libcurl4-gnutls-dev). But it always says
RuntimeError: The compiler failed to generate an executable file. You have to install development tools first.
Update 2: Okay, I have installed it to /usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/ and added gem "rchardet" to the Gemfile. A quick test with stdin shows that is is doing something...
Ooookay... I think, I know what's wrong. It seems like the data is wrong in the original database because it's in a utf8_general_ci column, but isn't imported with a UTF-8 connection. The string is in fact 'Grönemeyer' in the table.
The data is usually used in PHP with a latin1 connection. The wrong import and query seem to compensate each other here because the strings on our website are correct. When I make a "SET NAMES 'utf8'" query at the start of those PHP scripts I get the same problems. But I can't change the data and I am not able to reproduce the same effect in LogStash to get valid strings.
If anyone has an idea, how I could still retrieve correct data from this source, I would be very glad. But I'll probably have to set up da PHP script as an intermediate step and use a http_poller.
FWIW, charlock_holmes uses the C library from IBM's ICU project under the hood. IBM released a Java library of the same features. I did do an experiment with the Java code but was unconvinced of the overall usefulness of its universal charset detection, mainly IIRC because we have such small texts to detect on. I think they intended detection to be done on complete webpages or word documents.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.