Hello!
I'm having troubles with stabilizing Logstash 1.5.6 under load (30-60k tps).
I've reduced the problem to a simplest possible case (to eliminate elasticsearch as an output blocker):
- input with lumberjack and multiline codec
- no filters
- output to sysout (redirected to /dev/null)
I've started 6 instances of logstash on ports 5000, 5002, 5004, 5006, 5008, 5010 (as the single input thread is a bottleneck) and connected the logstash-forwarders (11 servers, each running 2 forwarders).
The system works fine for some time, each server sending around 3000 events every second:
2016/02/03 05:58:35.175857 Registrar: processing 1024 events
After 30-60 minutes first i/o timeouts start to appear and it gets worse every minute.
Then it ends with something like this
2016/02/03 05:57:53.364390 Connecting to [10.141.51.15]:5002 (10.141.51.15)
2016/02/03 05:57:53.724930 Connected to 10.141.51.15
2016/02/03 05:57:54.725167 Read error looking for ack: read tcp 10.141.51.15:5002: i/o timeout
2016/02/03 05:57:54.725269 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder_15.crt
2016/02/03 05:57:54.725718 Connecting to [10.141.51.15]:5010 (10.141.51.15)
2016/02/03 05:57:55.727050 Failed to tls handshake with 10.141.51.15 read tcp 10.141.51.15:5010: i/o timeout
2016/02/03 05:57:56.727424 Connecting to [10.141.51.15]:5000 (10.141.51.15)
2016/02/03 05:57:57.729849 Failed to tls handshake with 10.141.51.15 read tcp 10.141.51.15:5000: i/o timeout
2016/02/03 05:57:58.730216 Connecting to [10.141.51.15]:5008 (10.141.51.15)
2016/02/03 05:57:59.731860 Failed to tls handshake with 10.141.51.15 read tcp 10.141.51.15:5008: i/o timeout
2016/02/03 05:58:00.732209 Connecting to [10.141.51.15]:5000 (10.141.51.15)
2016/02/03 05:58:01.733470 Failed to tls handshake with 10.141.51.15 read tcp 10.141.51.15:5000: i/o timeout
2016/02/03 05:58:02.733805 Connecting to [10.141.51.15]:5010 (10.141.51.15)
2016/02/03 05:58:02.812858 Connected to 10.141.51.15
2016/02/03 05:58:02.980291 Registrar: processing 1024 events
2016/02/03 05:58:03.987605 Read error looking for ack: read tcp 10.141.51.15:5010: i/o timeout
2016/02/03 05:58:03.987694 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder_15.crt
2016/02/03 05:58:03.988301 Connecting to [10.141.51.15]:5006 (10.141.51.15)
2016/02/03 05:58:04.455730 Connected to 10.141.51.15
2016/02/03 05:58:04.651590 Registrar: processing 1024 events
2016/02/03 05:58:04.851652 Registrar: processing 1024 events
2016/02/03 05:58:05.114067 Registrar: processing 1024 events
2016/02/03 05:58:06.124210 Read error looking for ack: read tcp 10.141.51.15:5006: i/o timeout
2016/02/03 05:58:06.124316 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder_15.crt
2016/02/03 05:58:06.124859 Connecting to [10.141.51.15]:5000 (10.141.51.15)
2016/02/03 05:58:06.193736 Connected to 10.141.51.15
2016/02/03 05:58:07.193937 Read error looking for ack: read tcp 10.141.51.15:5000: i/o timeout
the input section of logstash:
input {
lumberjack {
port => 5000
ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder_15.crt"
ssl_key => "/etc/pki/tls/private/logstash-forwarder_15.key"
codec => multiline {
# Grok pattern names are valid! :)
pattern => "^\"?\d{4,8}"
negate => true
what => previous
}
}
}
And the samples from logstash log:
{:timestamp=>"2016-02-03T05:05:47.730000+0530", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2016-02-03T05:05:47.724000+0530", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rej
ecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2016-02-03T05:05:47.734000+0530", :message=>"CircuitBreaker::Open", :name=>"Lumberjack input", :level=>:warn}
{:timestamp=>"2016-02-03T05:05:49.940000+0530", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2016-02-03T05:05:49.946000+0530", :message=>"CircuitBreaker::Open", :name=>"Lumberjack input", :level=>:warn}
{:timestamp=>"2016-02-03T05:06:20.654000+0530", :message=>"CircuitBreaker::Close", :name=>"Lumberjack input", :level=>:warn}
CPU is almost unused (10-20% used), 40GB of memory available.
What might be an issue?
Lumberjack protocol? Multiline codec?
I'll try to run it without multiline... I'll update when I wake up