Logstash 1.5.6 doesn't accept connections from logstash-forwarder under load

Hello!

I'm having troubles with stabilizing Logstash 1.5.6 under load (30-60k tps).

I've reduced the problem to a simplest possible case (to eliminate elasticsearch as an output blocker):

  • input with lumberjack and multiline codec
  • no filters
  • output to sysout (redirected to /dev/null)

I've started 6 instances of logstash on ports 5000, 5002, 5004, 5006, 5008, 5010 (as the single input thread is a bottleneck) and connected the logstash-forwarders (11 servers, each running 2 forwarders).

The system works fine for some time, each server sending around 3000 events every second:
2016/02/03 05:58:35.175857 Registrar: processing 1024 events

After 30-60 minutes first i/o timeouts start to appear and it gets worse every minute.

Then it ends with something like this

2016/02/03 05:57:53.364390 Connecting to [10.141.51.15]:5002 (10.141.51.15) 
2016/02/03 05:57:53.724930 Connected to 10.141.51.15
2016/02/03 05:57:54.725167 Read error looking for ack: read tcp 10.141.51.15:5002: i/o timeout
2016/02/03 05:57:54.725269 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder_15.crt
2016/02/03 05:57:54.725718 Connecting to [10.141.51.15]:5010 (10.141.51.15) 
2016/02/03 05:57:55.727050 Failed to tls handshake with 10.141.51.15 read tcp 10.141.51.15:5010: i/o timeout
2016/02/03 05:57:56.727424 Connecting to [10.141.51.15]:5000 (10.141.51.15) 
2016/02/03 05:57:57.729849 Failed to tls handshake with 10.141.51.15 read tcp 10.141.51.15:5000: i/o timeout
2016/02/03 05:57:58.730216 Connecting to [10.141.51.15]:5008 (10.141.51.15) 
2016/02/03 05:57:59.731860 Failed to tls handshake with 10.141.51.15 read tcp 10.141.51.15:5008: i/o timeout
2016/02/03 05:58:00.732209 Connecting to [10.141.51.15]:5000 (10.141.51.15) 
2016/02/03 05:58:01.733470 Failed to tls handshake with 10.141.51.15 read tcp 10.141.51.15:5000: i/o timeout
2016/02/03 05:58:02.733805 Connecting to [10.141.51.15]:5010 (10.141.51.15) 
2016/02/03 05:58:02.812858 Connected to 10.141.51.15
2016/02/03 05:58:02.980291 Registrar: processing 1024 events
2016/02/03 05:58:03.987605 Read error looking for ack: read tcp 10.141.51.15:5010: i/o timeout
2016/02/03 05:58:03.987694 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder_15.crt
2016/02/03 05:58:03.988301 Connecting to [10.141.51.15]:5006 (10.141.51.15) 
2016/02/03 05:58:04.455730 Connected to 10.141.51.15
2016/02/03 05:58:04.651590 Registrar: processing 1024 events
2016/02/03 05:58:04.851652 Registrar: processing 1024 events
2016/02/03 05:58:05.114067 Registrar: processing 1024 events
2016/02/03 05:58:06.124210 Read error looking for ack: read tcp 10.141.51.15:5006: i/o timeout
2016/02/03 05:58:06.124316 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder_15.crt
2016/02/03 05:58:06.124859 Connecting to [10.141.51.15]:5000 (10.141.51.15) 
2016/02/03 05:58:06.193736 Connected to 10.141.51.15
2016/02/03 05:58:07.193937 Read error looking for ack: read tcp 10.141.51.15:5000: i/o timeout

the input section of logstash:

input {
  lumberjack {
    port => 5000
    ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder_15.crt"
    ssl_key => "/etc/pki/tls/private/logstash-forwarder_15.key"
      codec => multiline {
        # Grok pattern names are valid! :)
        pattern => "^\"?\d{4,8}"
        negate => true
        what => previous
      }
  }
}

And the samples from logstash log:

{:timestamp=>"2016-02-03T05:05:47.730000+0530", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2016-02-03T05:05:47.724000+0530", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rej
ecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2016-02-03T05:05:47.734000+0530", :message=>"CircuitBreaker::Open", :name=>"Lumberjack input", :level=>:warn}
{:timestamp=>"2016-02-03T05:05:49.940000+0530", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2016-02-03T05:05:49.946000+0530", :message=>"CircuitBreaker::Open", :name=>"Lumberjack input", :level=>:warn}
{:timestamp=>"2016-02-03T05:06:20.654000+0530", :message=>"CircuitBreaker::Close", :name=>"Lumberjack input", :level=>:warn}

CPU is almost unused (10-20% used), 40GB of memory available.
What might be an issue?
Lumberjack protocol? Multiline codec?

I'll try to run it without multiline... I'll update when I wake up :slight_smile:

Could be thread bound or network bound. Can you see cpu utilization per
core? BTW how many cores? Otherwise it sounds like the network is being too
chatty with too many connections.

Update: it crashed without multiline codec too :confused:

There are 16 virtual cores on that server. 64 GB of RAM.

Cpu utilisation per core (after a crash of communication, but logstashes are still up)

12:42:24     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
12:42:25     all   29,29    0,00    0,13    0,00    0,00    0,00    0,00    0,00   70,59
12:42:25       0   77,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   23,00
12:42:25       1   49,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   51,00
12:42:25       2   17,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   83,00
12:42:25       3   61,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   39,00
12:42:25       4   18,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   82,00
12:42:25       5   39,39    0,00    0,00    0,00    0,00    0,00    0,00    0,00   60,61
12:42:25       6   16,16    0,00    0,00    0,00    0,00    0,00    0,00    0,00   83,84
12:42:25       7   46,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   54,00
12:42:25       8   30,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   70,00
12:42:25       9   44,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   56,00
12:42:25      10    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00
12:42:25      11   16,00    0,00    1,00    0,00    0,00    0,00    0,00    0,00   83,00
12:42:25      12    4,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   96,00
12:42:25      13   17,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00   83,00
12:42:25      14   17,17    0,00    0,00    0,00    0,00    0,00    0,00    0,00   82,83
12:42:25      15   17,17    0,00    0,00    0,00    0,00    0,00    0,00    0,00   82,83

The network... I guess if the network was an issue then we would have timeouts on logstash-forwarder, but I wouldn't expect them on logstash.
Is my assumption right?

Ok, one more update.
My suspect was the SSL handshake process, so I was looking for a way to get rid of it.
As filebeats do not require the certs, keys and similar stuff + the 1.1 version is able to handle the multilines I've decided to give it a try.

I'm running 18 filebeats, thay are all sending requests to a single logstash (looks like it's not random anymore in filebeats). The load is much lower than it was yesterday.
It has been running for 3 hours and... no crash.

Is it possible that the handshake phase was blowing the logstash up after some time? Or it's just a coincidence and if I have a bigger load I'd experience the same issues as before?