Character encoding problems

Hello and thank you in advance for your help.
I can't seem to get the UTF-8 encoding for my logstash output right. It always looks like UTF-8 data has been interpreted as ISO-8859-1. My data sources have been a MySQL database (queried via JDBC input), a PHP script with JSON output (loaded with the HTTP poller) and stdin (for testing purposes). There is one common denominator: The ES output looks good, MySQL and stdout don't.
Let's take "fühlen" as an example: It's shown as "fühlen" in Kibana, but becomes "fühlen" in my database (JDBC output plugin) and the plain output (stdout). I tried the following test configuration:

input{stdin{ codec => plain { charset=>"UTF-8" } }}
output {
  stdout { codec => plain }
  jdbc {
    driver_class => "com.mysql.jdbc.Driver"
    connection_string => "jdbc:mysql://127.0.0.1/...&useUnicode=yes&characterEncoding=UTF-8"
    statement => [ "INSERT INTO mytable (field_a, field_b) VALUES( 'fühlen', ?)", "message" ]
    codec => plain { charset=>"UTF-8" }
  }
}

Input:
fühlen

Result:
stdout: 2018-04-04T09:58:13.247Z elasticsearch-vm fühlen
MySQL field_a: fühlen
MySQL field_b: fühlen

If I add another field with 'add_field => { "message2" => "fühlen" }' and try a rubydebug output, I get:
{
"message" => "f\xC3\xBChlen",
"@timestamp" => 2018-04-04T10:15:09.669Z,
"message2" => "fühlen",
"host" => "elasticsearch-vm",
"@version" => "1"
}

Does anyone have an idea what I am doing wrong? I'm still hoping that it's easy to fix and I'm missing the forest for the trees. Charsets will be the death of me ...
Best regards,
Jennifer

Not actually being helpful and answering your question here, but

Charsets will be the death of me ...

the story is that there's a question on the Cambridge University computer science exam that goes something like "explain why even experienced programmers have problems with character sets". They've been asking this question since the 1960s and look forward each year to seeing what the students will come up with this time (Once Upon A Time it was five-track paper tape escape codes, in my day it was ASCII vs EBCDIC, these days it might be Chinese email subject lines ...).

Thank you. It's a little bit comforting that I am not the only lonely desperate soul in this fight. Nevertheless I am still hoping for a knight in shining armour, who will save me from these ugly characters.
Can anybody help?

One thing to note. The codec charset directive is not a "to" operation, it is a "from" operation. In Logstash, the "to" charset is always UTF-8.

Logstash does not have universal charset detection so it needs to know what charset the strings are encoded in to be able to convert them to UTF-8.

When I run your config without the jdbc output I get:

input {
  stdin {
    codec => plain { charset=>"UTF-8" }
  }
}
output {
  stdout { codec => rubydebug }
}

result:

{
      "@version" => "1",
          "host" => "Elastics-MacBook-Pro.local",
    "@timestamp" => 2018-04-05T20:50:16.143Z,
       "message" => "fühlen"
}

From this SO post you could try locale charmap to see what the charset for the vm is, or for a more realistic test the exec input.

input {
  exec {
    command => "locale charmap"
    interval => 3600
  }
}
output {
  stdout { codec => rubydebug }
}

So my knight is wearing sun glasses – very contemporary :smiley:

Thank you very much for taking the time to deal with my problem. I really appreciate it.
And thank you for the "locale charmap" tip. That's good to know. Buuut:

"message" => "UTF-8\n"

And PHP code for the actual use case has

header('Content-Type: application/json; charset=utf-8');

So I have no idea what's happening there.


On a lighter note: Filter Error ASCII-8BIT to UTF-8 - #2 by yaauie gave me a workaround that gives me hope.

ruby {
  code => '
    hash = event.to_hash
    hash.each do |key,value|
      if value.is_a? String then
        value = value.force_encoding("UTF-8")
      end
    end
  '
}

It produces this rubydebug output:

{
"@timestamp" => 2018-04-06T09:16:10.795Z,
"host" => "elasticsearch-vm",
"@version" => "1",
"message" => "fühlen",
"message2" => "fühlen"
}

I used the same code for my http poller pipeline and was able to import this into the MySQL database:

Échame La Culpa

Now I'll just have to wait a few hours until a potentially problematic entry occurs in the other database. If that can be imported successfully, I'm quite happy. It seems like a cumbersome solution for a problem I shouldn't even have. But at least now I'll have nice data.

If someone has a better idea, feel free to post!


UPDATE: "Immer noch fühlen" has been imported from the other database successfully!

Aaaaaaand the story continues. And 'it's kind of a funny story': The previous database table was latin1_swedish_ci. There I managed to get correct UTF-8 strings. NOW I am querying an utf8_general_ci table and all I get is gibberish :tired_face: I tried to solve the issue with Ruby converting back and forth to UTF-8, UTF-16, ISO-8859-1 etc., but I can't get it to work for the life of me.

There aren't that many entries like that, but I need to get rid of artists like "Herbert Grönemeyer" ...

I would be really happy, if someone
a) knows an easy solution or
b) could help me install the charlock_holmes Ruby gem or any other lib that might save me. I tried
env GEM_HOME=/opt/logstash/vendor/bundle/jruby/1.9 /usr/share/logstash/vendor/jruby/bin/jruby /usr/share/logstash/vendor/jruby/bin/gem install charlock_holmes
as well as installing a bunch of dependencies that were mentioned in different forums (libgmp-dev, libgmp3-dev,libpcre3, libpcre3-dev, git-core, curl, zlib1g-dev, build-essential, libssl-dev, libreadline-dev,libyaml-dev, libsqlite3-dev, sqlite3, libxml2-dev, libxslt1-dev, libcurl4-openssl-dev, python-software-properties, libffi-dev, make, libcurl4-gnutls-dev). But it always says

RuntimeError: The compiler failed to generate an executable file. You have to install development tools first.

with the following log:

" -o conftest -I/include/universal-java1.8 -I/usr/share/logstash/vendor/jruby/lib/ruby/include/ruby/backward -I/usr/share/logstash/vendor/jruby/lib/ruby/include -I. -fPIC -fno-omit-frame-pointer -fno-strict-aliasing -fexceptions conftest.c -L. -L/usr/share/logstash/vendor/jruby/lib -m64"...

I've never worked with Ruby before.

I would be really thankful for any idea, any push into the right direction...
...Please?

That library charlock_holmes uses the C language for the internals and can't be used in JRuby (the Ruby engine that Logstash uses).

You might have better luck with this one rchardet. It seems to be pure Ruby.

Aaah, thank you very much. I'll give that a try today.


First update: I was able to install it. That's something :smiley:

Fetching: rchardet-1.7.0.gem (100%)
Successfully installed rchardet-1.7.0
1 gem installed


Update 2: Okay, I have installed it to /usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/ and added gem "rchardet" to the Gemfile. A quick test with stdin shows that is is doing something...

 ruby {
    init => "require 'rchardet'"
    code => "event.set('charset', CharDet.detect(event.get('message'))) "
  }

"charset" => {
"confidence" => 1.0,
"encoding" => "ascii"
}

Next step: Let's see what it has to say about my JDBC data.


Update 3: ...oh :confused:

"charset" => {
"confidence" => 0.7525,
"encoding" => "utf-8"
}

Ooookay... I think, I know what's wrong. It seems like the data is wrong in the original database because it's in a utf8_general_ci column, but isn't imported with a UTF-8 connection. The string is in fact 'Grönemeyer' in the table.
The data is usually used in PHP with a latin1 connection. The wrong import and query seem to compensate each other here because the strings on our website are correct. When I make a "SET NAMES 'utf8'" query at the start of those PHP scripts I get the same problems. But I can't change the data and I am not able to reproduce the same effect in LogStash to get valid strings.

If anyone has an idea, how I could still retrieve correct data from this source, I would be very glad. But I'll probably have to set up da PHP script as an intermediate step and use a http_poller.

@Jenni

FWIW, charlock_holmes uses the C library from IBM's ICU project under the hood. IBM released a Java library of the same features. I did do an experiment with the Java code but was unconvinced of the overall usefulness of its universal charset detection, mainly IIRC because we have such small texts to detect on. I think they intended detection to be done on complete webpages or word documents.

See this bug report as an example of what I mean.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.