Logstash 7.14.0 "invalid byte sequence in UTF-8" in logstash.javapipeline

Came into work today and noticed that our Logstash pipeline (on all two hosts) had halted with the following error in the logs.

Aug 11 00:22:40 its-elk-p03.uod.otago.ac.nz logstash[20536]: [2021-08-11T00:22:40,141][ERROR][logstash.javapipeline ][main] Pipeline worker error, the pipeline will be stopped {:pipeline_id=>"main", :error=>"(ArgumentError) invalid byte sequence in UTF-8", :exception=>Java::OrgJrubyExceptions::ArgumentError, :backtrace=>["org.jruby.RubyRegexp.match?(org/jruby/RubyRegexp.java:1180)", "usr.share.logstash.logstash_minus_core.lib.logstash.java_pipeline.start_workers(/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:295)"], :thread=>"#<Thread:0xf1261f4 sleep>"}

The other server suffered the same at 00:22:39

There don't appear to be any other useful log entries preceeding that.

The error occurred very soon after midnight, and we did just put in place the default logstash log4j2.properties file (it was missing from our deployment and previously was all going to stderr which is captured by journald).

We have a number of inputs that logstash reads from, but they are both Kafka topics of largely two groups (one is a plain JSON topic, the other is AVRO). We had restarted Logstash and the same error occurred.

We tried to isolate the error by using kafka-console-consumer etc. to see what messages might have caused it, but all messages appear to be valid JSON.

We changed the logging level of logstash to DEBUG

We further tried disabling the individual inputs so that logstash was running from just one at a time (still with DEBUG logging).... we have yet to see the error return.

I've run logstash normally (ie. both inputs) with DEBUG logging and default (INFO) logging, and it has been running fine, so I'm thinking its due to an issue with log rotation.

I'm gonna babysit it tonight and see if the same occurs.

Question: would configuring Logstash to use its Dead-Letter-Queue functionality help with this? (It's on the plan to implement, just wondering if I should bump it up).

Think I stumbled upon the root cause; there are actually other inputs (which I forgot about because I was overly focussed on Kafka).

One such was a 'pipe' input that was running a custom program that reads in a funky logging protocol (Cisco ISE fragments huge log messages over multiple syslog messages), and we had recently upgraded ISE which presumably means that we have a bug to fix, because Elasticsearch told us that the index it was writing to reached the limit of 1000 fields.... because for some reason the generated JSON was emitting various dates as JSON keys... presumably there was some funnyness going on and some message ended up being generated that was not valid....

... this meant that the following logstash code gets handed a non UTF-8 input. I've sinced added skip_invalid_inputs (because I'm now on 7.14), though I don't know if that will prevent this trigger condition from causing the pipeline to crash.

filter {
  if [message] =~ /^\s*{/ {
    json {
      id => "parse-json-in-message.json.38"
      source => "message"
      target => "m"
      remove_field => "message"
      skip_on_invalid_json => true
    }
  }
}

Apparently this was not the [only] cause because the problem resurfaced today, at about 8am sharpish.

[2021-08-11T13:04:18,744][ERROR][logstash.javapipeline    ][main] Pipeline worker error, the pipeline will be stopped {:pipeline_id=>"main", :error=>"(ArgumentError) invalid byte sequence in UTF-8", :exception=>Java::OrgJrubyExceptions::ArgumentError, :backtrace=>["org.jruby.RubyRegexp.match?(org/jruby/RubyRegexp.java:1180)", "usr.share.logstash.logstash_minus_core.lib.logstash.java_pipeline.start_workers(/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:295)"], :thread=>"#<Thread:0x2230aabb sleep>"}
[2021-08-11T13:04:18,749][INFO ][filewatch.observingread  ][main] QUIT - closing all files and shutting down.
[2021-08-11T13:04:18,750][WARN ][logstash.javapipeline    ][main] Waiting for input plugin to close {:pipeline_id=>"main", :thread=>"#<Thread:0x2230aabb run>"}

I've restarted logstash with debug-level logging in the hope that it shows the issue.... and it seems to have recurred, but without anything useful in the logs.

Think I found the culprit, a web log that was 'JSON' but not proper JSON. When displayed on-screen I get Unicode replacement characters in the URI part of the log.

The raw log, from Nginx, was sent from Filebeat, and contains the following (when viewed in 'less', so the non-text characters are escaped)

"uri": "/theme/META-INF/<C0><AE><C0><AE>/<C0><AE><C0><AE>/<C0><AE><C0><AE>/
<C0><AE><C0><AE>/<C0><AE><C0><AE>/<C0><AE><C0><AE>/<C0><AE><C0><AE>/<C0><AE><C0><AE>/<C0><AE><C0><AE>/<C0><AE><C0><AE>/etc/passwd"

It's a reasonably common path traversal attack that uses Unicode encodings (C0 AE is a '.', so its trying to do a '../../../..' style of thing), and I would not expect Logstash to crash the pipeline because of untrusted user-provided input.

Note that the byte sequence C0 A0, while valid, is an illegal UTF-8 sequence because it fails the

The log from Nginx is configured with the following (I log a rich set of information):

    # escape=json requires Nginx version 1.11.8 or later
    log_format uoo_logstash_json escape=json
        '{ "@timestamp": "$time_iso8601", '
...
        '"uri": "$uri", '
...
        '}';

See RFC 3629 - UTF-8, a transformation format of ISO 10646 section 10

Implementers of UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 sequences. It is conceivable that in some
circumstances an attacker would be able to exploit an incautious
UTF-8 parser by sending it an octet sequence that is not permitted by
the UTF-8 syntax.

[...] Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F. This last exploit has
actually been used in a widespread virus attacking Web servers in
2001; thus, the security threat is very real.

I think JRuby or Logstash needs to have a look at this, and various feeder systems need to detect this too.... in the meantime, how to harden Logstash against this?

Aha, I'm closer now, and it turns out that it is my fault.

Why? The input that this message came on was a custom Kafka AVRO module, which I forked from another (because of schema registry support). If the same message was read in using, say the 'line' codec, then those characters get replaced with the Unicode replacement character.

This is because of the following code in line.rb

  def register
    require "logstash/util/buftok"
    @buffer = FileWatch::BufferedTokenizer.new(@delimiter)
    @converter = LogStash::Util::Charset.new(@charset)           # THIS LINE
    @converter.logger = @logger
  end

  def decode(data)
    @buffer.extract(data).each { |line| yield LogStash::Event.new(MESSAGE_FIELD => @converter.convert(line)) }  # USED HERE
  end

And in LogStash::Util::Charset we see inside the convert method:

    unless data.valid_encoding?
      return data.inspect[1..-2].tap do |escaped|
        @logger.warn("Received an event that has a different character encoding than you configured.", :text => escaped, :expected_charset => @charset)
      end
    end

'valid_encoding?' is a standard Ruby String method that will return false if a string is not properly encoded.

Aha! So the moral of the story is that if you're a maintainer of any codecs, please ensure that you guard against such encoding errors. It would be useful if that was listed as something to check for in the plugin development docs.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.