Logstash 7.14.0 "invalid byte sequence in UTF-8" in logstash.javapipeline

cknz · August 13, 2021, 10:56am

Aha, I'm closer now, and it turns out that it is my fault.

Why? The input that this message came on was a custom Kafka AVRO module, which I forked from another (because of schema registry support). If the same message was read in using, say the 'line' codec, then those characters get replaced with the Unicode replacement character.

This is because of the following code in line.rb

  def register
    require "logstash/util/buftok"
    @buffer = FileWatch::BufferedTokenizer.new(@delimiter)
    @converter = LogStash::Util::Charset.new(@charset)           # THIS LINE
    @converter.logger = @logger
  end

  def decode(data)
    @buffer.extract(data).each { |line| yield LogStash::Event.new(MESSAGE_FIELD => @converter.convert(line)) }  # USED HERE
  end

And in LogStash::Util::Charset we see inside the convert method:

    unless data.valid_encoding?
      return data.inspect[1..-2].tap do |escaped|
        @logger.warn("Received an event that has a different character encoding than you configured.", :text => escaped, :expected_charset => @charset)
      end
    end

github.com

elastic/logstash/blob/v7.14.0/logstash-core/lib/logstash/util/charset.rb

# Licensed to Elasticsearch B.V. under one or more contributor
# license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright
# ownership. Elasticsearch B.V. licenses this file to you under
# the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#  http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

require "logstash/util"

class LogStash::Util::Charset

This file has been truncated. show original

'valid_encoding?' is a standard Ruby String method that will return false if a string is not properly encoded.

Aha! So the moral of the story is that if you're a maintainer of any codecs, please ensure that you guard against such encoding errors. It would be useful if that was listed as something to check for in the plugin development docs.

Topic		Replies	Views
CRASHING: source sequence is illegal/malformed utf-8 Logstash	2	3228	July 6, 2017
Logstash stops receiving logs Logstash	8	1295	November 15, 2021
Got an Error - exception=>#<ArgumentError: invalid byte sequence in UTF-8>, Logstash	6	1647	March 1, 2019
Logstash invalid byte sequence in UTF-8 jdbc input Logstash	3	1479	September 9, 2019
Logstash JSON parser error Logstash	2	976	July 6, 2017

Logstash 7.14.0 "invalid byte sequence in UTF-8" in logstash.javapipeline

Related topics