Charset not considered on HTTP input plugin

Javier_Bravo_Conde · October 11, 2016, 11:51pm

Hi!

Before opening a ticket I would like to discuss my issue here.

I receive a event.message string encoded in UTF-8 even though I specify codec => plain { charset => "ASCII-8BIT"} in my config.

Here my config:

input {
http {
host => "0.0.0.0"
port => 8080
codec => plain { charset => "ASCII-8BIT"}
}
}
filter {
example {
}
}
output {
stdout { codec => rubydebug }
}

Here an extract my custom filter:

public
def filter(event)

@logger.debug? && @logger.debug("The event.message size is: #{event.get("message").size()}")
@logger.debug? && @logger.debug("The event.message encoding is: #{event.get("message").encoding}")

counter = 0;
event.get("message").each_byte { |c| 
#Increments the counter for each byte within the string
counter +=1
}
@logger.debug? && @logger.debug("There are #{counter} bytes in the string")
# filter_matched should go in the last line of our successful code
filter_matched(event)
end # def filter

And here the output (I expect The event.message encoding is: ASCII-8BIT)

filter received {:event=>{"message"=>"H\u0000\u0002\u0001\a\u0000\u0000\u0000\u0001\u0002\u0003\u0004\u0005C&\v\u0000\u0000\u0000\u0000\u0000\u0000�F\u0000\u0000"\v\u0000", "@version"=>"1", "@timestamp"=>"2016-10-11T23:32:52.277Z", "host"=>"192.168.1.130", "headers"=>{"request_method"=>"POST", "request_path"=>"/", "request_uri"=>"/", "http_version"=>"HTTP/1.1", "http_user_agent"=>"Mozilla/4.0 (compatible; AP:FiOS-Mercury/3.2.2.3.3.3.6; PL:Motorola-DCT/KA15.76.12.19AlderF.560; BX:VMS1100; UA:0000108336906021; U; en-US)", "http_host"=>"192.168.1.239:8080", "http_accept"=>"/", "content_type"=>"application/x-www-form-urlencoded", "content_length"=>"2880"}}, :level=>:debug, :file=>"(eval)", :line=>"41", :method=>"filter_func"}
The event.message size is: 2880 {:level=>:debug, :file=>"logstash/filters/example.rb", :line=>"18", :method=>"filter"}
The event.message encoding is: UTF-8 {:level=>:debug, :file=>"logstash/filters/example.rb", :line=>"19", :method=>"filter"}
There are 2898 bytes in the string {:level=>:debug, :file=>"logstash/filters/example.rb", :line=>"27", :method=>"filter"}

Javier_Bravo_Conde · October 12, 2016, 2:06am

I just figured out this is actually a feature implemented on LogStash::Util::Charset

def convert(data)
data.force_encoding(@charset_encoding)

NON UTF-8 charset declared.

Let's convert it (as cleanly as possible) into UTF-8 so we can use it with JSON, etc.

return data.encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace) unless @charset_encoding == Encoding::UTF_8
...

I might be missing something, but it would be great if we could specify some kind of 'keep_original_charset', this would allow handling arbitrary binary protocols at filter level.

Javier_Bravo_Conde · October 13, 2016, 4:39pm

Just in case someone hit a similar problem, I solved the issue by coding a custom codec and

Inside there you have the original encoding and you can do pretty much whatever you want (parse it, create an string-array of hex's...)

Here an extract:

public
def decode(data)

array_data = data.unpack('C*')

header_char = array_data.shift(1).pack('C*')                          #1 My first byte
header_version_number = getnumber_frombytes(array_data.shift(1))      #2 My second byte
header_platform_id_number = getnumber_frombytes(array_data.shift(1))  #3 My third byte
header_isextended_number = getnumber_frombytes(array_data.shift(1))   #4 My fourth byte
....

Topic		Replies	Views
Received an event that has a different character encoding than you configured:) Logstash	3	943	May 1, 2019
Syslog input, received character set? Logstash	4	3816	July 6, 2017
Different charset in filter http Logstash	1	406	February 12, 2020
Received an event that has a different character encoding than you configured Logstash	1	3677	January 27, 2017
Logstash 6.0 http input not accepting utf-8 Logstash	4	2490	December 13, 2017

Charset not considered on HTTP input plugin

NON UTF-8 charset declared.

Let's convert it (as cleanly as possible) into UTF-8 so we can use it with JSON, etc.

Related topics