Logstash beats input - a lot of 'input: unhandled exception' errors

Hello, Logstash guru's,

I am using Logstash 2.3.2, filebeat 1.2.2 and logstash-beats-input 2.2.8.

I am sending logs with filebeat from 50 servers to a single Logstash server (using beats input).

At half servers I am getting filebeat errors like below:

2016/05/04 11:59:07.199854 single.go:76: INFO Error publishing events (retrying): EOF
2016/05/04 11:59:07.199918 single.go:152: INFO send fail
2016/05/04 11:59:07.201417 single.go:159: INFO backoff retry: 1s
2016/05/04 12:00:38.221792 single.go:76: INFO Error publishing events (retrying): read tcp 10.150.17.11:58739->10.125.32.30:5556: i/o timeout
2016/05/04 12:00:38.221839 single.go:152: INFO send fail
2016/05/04 12:00:38.221851 single.go:159: INFO backoff retry: 2s
2016/05/04 12:00:54.032528 single.go:76: INFO Error publishing events (retrying): EOF
2016/05/04 12:00:54.032559 single.go:152: INFO send fail
2016/05/04 12:00:54.032570 single.go:159: INFO backoff retry: 4s
2016/05/04 12:01:34.256385 single.go:76: INFO Error publishing events (retrying): read tcp 10.148.27.50:58898->10.144.36.30:5556: read: connection reset by peer
2016/05/04 12:01:34.256425 single.go:152: INFO send fail
2016/05/04 12:01:34.256435 single.go:159: INFO backoff retry: 8s
2016/05/04 12:02:08.162525 single.go:76: INFO Error publishing events (retrying): EOF
2016/05/04 12:02:08.162563 single.go:152: INFO send fail
2016/05/04 12:02:08.162573 single.go:159: INFO backoff retry: 16s
2016/05/04 12:02:37.764291 single.go:76: INFO Error publishing events (retrying): EOF

Filebeat on these servers never connects to Logstash (but I am 100% sure that appropriate logstash port is accessible from them). I tried to restart Logstash and filebeat, but this didn't help.
In the same time, on Logstash side I see a lot of totally different errors, mostly

{:timestamp=>"2016-05-04T11:29:41.217000+0000", :message=>"Beats input: unhandled exception", :exception=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtrace=>["org/jruby/RubyIO.java:2996:in sysread'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/lumberjack/beats/server.rb:462:inread_socket'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/lumberjack/beats/server.rb:442:in run'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats_support/connection_handler.rb:34:inaccept'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats.rb:211:in handle_new_connection'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats_support/circuit_breaker.rb:42:inexecute'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats.rb:211:in handle_new_connection'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats.rb:167:inrun'"], :level=>:error}

and

{:timestamp=>"2016-05-04T11:48:49.167000+0000", :message=>"Beats input: unhandled exception", :exception=>#<Zlib::BufError: buffer error>, :backtrace=>["org/jruby/ext/zlib/ZStream.java:134:in finish'", "org/jruby/ext/zlib/JZlibInflate.java:72:ininflate'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/lumberjack/beats/server.rb:380:in compressed_payload'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/lumberjack/beats/server.rb:251:infeed'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/lumberjack/beats/server.rb:462:in read_socket'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/lumberjack/beats/server.rb:442:inrun'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats_support/connection_handler.rb:34:in accept'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats.rb:211:inhandle_new_connection'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats_support/circuit_breaker.rb:42:in execute'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats.rb:211:inhandle_new_connection'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.8/lib/logstash/inputs/beats.rb:167:in `run'"], :level=>:error}

and others, like 'LogStash::Json::ParserError: Unexpected character' and 'undefined method `<=' for nil:NilClass'. All these errors come from beats input only.

Full error stacktraces are here: http://pastebin.com/gLPHE77a

Please help me to understand if this is a known issue, or should I report this as a bug with Logstash / filebeat configs / debug logs attached.

P.S. Thank you for awesome products ) except for some troubles like this one, ELK stack is very cool :wink:

Update: at 2 servers I noticed a short period when filebeat was able to send a bunch of logs to Logstash, and then quickly failed with EOF again:

2016/05/04 12:26:22.239021 single.go:76: INFO Error publishing events (retrying): EOF
2016/05/04 12:26:22.239067 single.go:152: INFO send fail
2016/05/04 12:26:22.239078 single.go:159: INFO backoff retry: 1m0s
2016/05/04 12:27:34.776050 publish.go:104: INFO Events sent: 2000
2016/05/04 12:27:34.779450 registrar.go:163: INFO Registry file updated. 435 states written.
2016/05/04 12:27:39.822704 single.go:76: INFO Error publishing events (retrying): EOF
2016/05/04 12:27:39.822761 single.go:152: INFO send fail
2016/05/04 12:27:39.825496 single.go:159: INFO backoff retry: 1s

I am checking for network issues and file descriptors issues on Logstash side, but for now I don't see any.

Looks like some weird problem with zlib compress/decompress. Can you try to reduce filebeat.spool_size and/or logstash.output.bulk_max_size in filebeat?

1 Like

Thanks you for your suggestion, steffens!

I've reduced filebeat.spool_size from 2000 to 200, and I don't see any exceptions in Logstash logs for like 3 hours already.

But let me take a look a bit more on it, so I can finally confirm that it's fixed.

Hi there,

I re-checked Logstash logs once again, and I see same errors. I tried filebeat.spool_size: 100 and even filebeat.spool_size: 10, but the only change I notice is that amount of errors is being reduced if I set lesser filebeat.spool_size. With filebeat.spool_size: 10 I see about 20 messages / hour while with filebeat.spool_size: 2000 I had like 20k messages / hour.

My log messages are multiline, but they aren't huge (avg message bytesize is 1Kb). Also it's enough CPU, RAM and file descriptors at Logstash host. My setup is pretty typical at all so I can't say that too heavy Logstash load is the case.

I will try to juggle with logstash.output.bulk_max_size and other filebeat settings.

Haven't had time to further investigate this one yet, but errors seem to be related to total size of events (number of events + per event size) to be compressed. Maybe (due to multiline) very big event slips through from time to time.

If you can affort it, you can disable compression by setting compression_level (doc) to 0. This might increase network bandwidth requirements by about 7 (not really recommended).

Filebeat has some more options to limit the event sizes:
max_bytes, multiline.max_lines.

After some experiments, I have an update for this.

Disabling compression helped a bit. ZStream.java exceptions are now gone. But I still see a lot of EBADF: Bad file descriptor and LogStash::Json::ParserError: Unexpected character ('J' exceptions (full stacktraces are here).

Regarding the total events size. I graphed the max message size in symbols in Kibana, and found that sometimes my env produce pretty big messages: 30-50k symbols per event.

In the same time, my local testing with messages, smaller than 1k does NOT show any exceptions.

@steffens, looks like you are right, the total size of events is the cause.

Unfortunately, I can't reduce the message size for this events, or cut them off in any manner. All these huge messages are TRAFFIC logs and I need them in ELK :wink:

So, what I would really like to know is what are max throughput limits for [filebeat + Logstash beats input] pair so I can set max_bytes, logstash.output.bulk_max_size and multiline.max_lines close to that max limits and therefore get rid of exceptions; and then see, if limits will break my logs in any manner.

@steffens, could you, please, suggest me where to find these limits? Maybe, filebeat has integration tests for big messages where I can see the max tested message size?

We've got some integration tests, but I haven't been able to reproduce any these issues myself.

Doing some tests with very simple random traffic generation I was able to trigger some errors in zlib decompression at total buffer size of ~1MB. But my errors do not match your traces. Especially the new errors after disabling compression.