Lumberjack input: the pipeline is blocked, temporary refusing new connection


#1

We have a simple setup with 4 servers:
Logstash (Log receiver)
|
/
Redis (Queue)
|
/
ElasticSearch (Storage)
|
/
Kibana (Viewer)

Receiver log is full of similar messages

{:timestamp=>"2016-01-21T15:17:49.408000-0800", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::OpenBreaker, :level=>:warn}
{:timestamp=>"2016-01-21T15:17:49.602000-0800", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}

There are no errors or significant cpu/memory/network load on any of the other servers. There is also no significant cpu/memory/network load on the receiving server.

How do we identify the bottleneck?
Thanks!


(Magnus Bäck) #2

Do you have some kind of plugin to Elasticsearch to pull messages from Redis? Or is your graph incomplete? Normally one would use Logstash between Redis and Elasticsearch.


#3

Second Logstash server is running on the same instance as ES server. It pulls from Redis and outputs into ES.


(Magnus Bäck) #4

The error message suggests that Logstash isn't able to push events to Redis as fast as events are arriving via the lumberjack protocol. Do you have any inflow to Redis? Is there anything interesting in the logs if you crank up the logging verbosity with --verbose?


#5

Redis server is largely idle. I've been monitoring network usage, CPU and RAM and there are no indication that it is under any load.

I updated Redis logging level to verbose, but I'm not a 100% what to look for in Redis logs. They appear normal:

[13501] 22 Jan 11:41:41 * DB saved on disk
[1332] 22 Jan 11:41:41 * Background saving terminated with success
[1332] 22 Jan 11:42:55 * 10000 changes in 60 seconds. Saving...
[1332] 22 Jan 11:42:55 * Background saving started by pid 13502
[13502] 22 Jan 11:42:55 * DB saved on disk
[1332] 22 Jan 11:42:55 * Background saving terminated with success
[1332] 22 Jan 11:44:08 * 10000 changes in 60 seconds. Saving...
[1332] 22 Jan 11:44:08 * Background saving started by pid 13507
[13507] 22 Jan 11:44:08 * DB saved on disk
[1332] 22 Jan 11:44:08 * Background saving terminated with success
[1332] 22 Jan 11:45:19 * 10000 changes in 60 seconds. Saving...
[1332] 22 Jan 11:45:19 * Background saving started by pid 13510
[13510] 22 Jan 11:45:19 * DB saved on disk
[1332] 22 Jan 11:45:19 * Background saving terminated with success


(Magnus Bäck) #6

I was referring to the Logstash log. I'd be surprised if there wasn't some kind of indication of why it's slow or blocked.


#7

Before I got a chance to increase logging level on Logstash, the errors magically disappeared.

Am I correct in saying that if we discover that Redis is actually overloaded, adding a second server is as simple as adding a second host to the output like so:

output {
  redis {
    data_type => "list"
    key => "logstash:cache:dev"
    host => [ "redis1", "redis2" ]
  }
}

and then collecting from two servers like so:

input {
  redis {
    host => "redis1"
    data_type => "list"
    key => "logstash:cache:dev"
  }
}
input {
  redis {
    host => "redis2"
    data_type => "list"
    key => "logstash:cache:dev"
  }
}

Thanks!


(Magnus Bäck) #8

I'm not completely sure the redis output falls back to the remaining servers if one of them pushes back because of load (if that indeed is the problem here). I'd inspect the code and try it out before concluding.


(system) #9