We have a simple setup with 4 servers:
Logstash (Log receiver)
|
/
Redis (Queue)
|
/
ElasticSearch (Storage)
|
/
Kibana (Viewer)
Receiver log is full of similar messages
{:timestamp=>"2016-01-21T15:17:49.408000-0800", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::OpenBreaker, :level=>:warn}
{:timestamp=>"2016-01-21T15:17:49.602000-0800", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
There are no errors or significant cpu/memory/network load on any of the other servers. There is also no significant cpu/memory/network load on the receiving server.
Do you have some kind of plugin to Elasticsearch to pull messages from Redis? Or is your graph incomplete? Normally one would use Logstash between Redis and Elasticsearch.
The error message suggests that Logstash isn't able to push events to Redis as fast as events are arriving via the lumberjack protocol. Do you have any inflow to Redis? Is there anything interesting in the logs if you crank up the logging verbosity with --verbose?
Redis server is largely idle. I've been monitoring network usage, CPU and RAM and there are no indication that it is under any load.
I updated Redis logging level to verbose, but I'm not a 100% what to look for in Redis logs. They appear normal:
[13501] 22 Jan 11:41:41 * DB saved on disk
[1332] 22 Jan 11:41:41 * Background saving terminated with success
[1332] 22 Jan 11:42:55 * 10000 changes in 60 seconds. Saving...
[1332] 22 Jan 11:42:55 * Background saving started by pid 13502
[13502] 22 Jan 11:42:55 * DB saved on disk
[1332] 22 Jan 11:42:55 * Background saving terminated with success
[1332] 22 Jan 11:44:08 * 10000 changes in 60 seconds. Saving...
[1332] 22 Jan 11:44:08 * Background saving started by pid 13507
[13507] 22 Jan 11:44:08 * DB saved on disk
[1332] 22 Jan 11:44:08 * Background saving terminated with success
[1332] 22 Jan 11:45:19 * 10000 changes in 60 seconds. Saving...
[1332] 22 Jan 11:45:19 * Background saving started by pid 13510
[13510] 22 Jan 11:45:19 * DB saved on disk
[1332] 22 Jan 11:45:19 * Background saving terminated with success
Before I got a chance to increase logging level on Logstash, the errors magically disappeared.
Am I correct in saying that if we discover that Redis is actually overloaded, adding a second server is as simple as adding a second host to the output like so:
I'm not completely sure the redis output falls back to the remaining servers if one of them pushes back because of load (if that indeed is the problem here). I'd inspect the code and try it out before concluding.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.