Logstash Invalid Character for UTF-16/Unicode encoding

Hi,

Below is the logstash Input plugin configured to read logs generated by my Windows Application,

input {
file {
path => ["D:/ELK/LoggerTestApp/Server/Logger/GetAssetPointerById/**/*.txt"]
codec => plain { charset => "UTF-16" }
sincedb_path => ["D:/ELK/logstash/since.db"]
start_position => "beginning"
}
}

My Application is configured to write logs in Unicode/UTF-16 encoding format only. Once the logs are read & ported to ElasticSearch, I'm seeing a invalid character (�) in each log as shown below

Please advice me on how to avoid these invalid characters.

Environment Details

  1. Operating System : Win 7, Win 2008 R2
  2. ElasticSearch : elasticsearch-2.3.4
  3. Logstash : logstash-2.3.4
  4. Kibana : kibana-4.5.3-windows

Thanks in Advance.

Did you resolve this?
Did you try a different charset, like charset => "ISO-8859-1"?
Or are you looking for elasticsearch to handle the charset differently: https://www.elastic.co/guide/en/elasticsearch/guide/current/unicode-normalization.html

Hi Matthew,

Thanks for your response.

We're still living with the Invalid character issue.

I'll try the suggestions provided .

1 Like

Hi,
did you resolve this?

Was there any resolution here? I am currently dealing with the same situation.

The question-mark-in-black-diamond character is a replacement character that is used when the UTF16 -> UTF8 character conversion fails.

This piece of config codec => plain { charset => "UTF-16" } says to Logstash "Treat all text as UTF16 and convert it to UTF8"

There may be some illegal surrogates http://unicode.org/faq/utf_bom.html#utf16-7
or maybe the charset conversion library we use does not deal with noncharacters http://www.unicode.org/faq/private_use.html#noncharacters very well.

Hi,

I had the same kind of problem: a "�" (black diamond or cube) at the end of each line after converting from UTF-16.

I don't know what the character was (carriage return, vertical tab,... - something like that) but I did not need it.

I worked around the problem by stripping the "�" from the message field:

mutate {
  gsub => ["message","�",""]
}
1 Like