Logstash 7.14.0 "invalid byte sequence in UTF-8" in logstash.javapipeline

Aha, I'm closer now, and it turns out that it is my fault.

Why? The input that this message came on was a custom Kafka AVRO module, which I forked from another (because of schema registry support). If the same message was read in using, say the 'line' codec, then those characters get replaced with the Unicode replacement character.

This is because of the following code in line.rb

  def register
    require "logstash/util/buftok"
    @buffer = FileWatch::BufferedTokenizer.new(@delimiter)
    @converter = LogStash::Util::Charset.new(@charset)           # THIS LINE
    @converter.logger = @logger
  end

  def decode(data)
    @buffer.extract(data).each { |line| yield LogStash::Event.new(MESSAGE_FIELD => @converter.convert(line)) }  # USED HERE
  end

And in LogStash::Util::Charset we see inside the convert method:

    unless data.valid_encoding?
      return data.inspect[1..-2].tap do |escaped|
        @logger.warn("Received an event that has a different character encoding than you configured.", :text => escaped, :expected_charset => @charset)
      end
    end

'valid_encoding?' is a standard Ruby String method that will return false if a string is not properly encoded.

Aha! So the moral of the story is that if you're a maintainer of any codecs, please ensure that you guard against such encoding errors. It would be useful if that was listed as something to check for in the plugin development docs.