Aha, I'm closer now, and it turns out that it is my fault.
Why? The input that this message came on was a custom Kafka AVRO module, which I forked from another (because of schema registry support). If the same message was read in using, say the 'line' codec, then those characters get replaced with the Unicode replacement character.
This is because of the following code in line.rb
def register
require "logstash/util/buftok"
@buffer = FileWatch::BufferedTokenizer.new(@delimiter)
@converter = LogStash::Util::Charset.new(@charset) # THIS LINE
@converter.logger = @logger
end
def decode(data)
@buffer.extract(data).each { |line| yield LogStash::Event.new(MESSAGE_FIELD => @converter.convert(line)) } # USED HERE
end
And in LogStash::Util::Charset we see inside the convert method:
unless data.valid_encoding?
return data.inspect[1..-2].tap do |escaped|
@logger.warn("Received an event that has a different character encoding than you configured.", :text => escaped, :expected_charset => @charset)
end
end
'valid_encoding?' is a standard Ruby String method that will return false if a string is not properly encoded.
Aha! So the moral of the story is that if you're a maintainer of any codecs, please ensure that you guard against such encoding errors. It would be useful if that was listed as something to check for in the plugin development docs.