We are currently trying to set up a HA ELK stack for our company. Our current architecture looks like this:
Logstash Forwarder -> Logstash Shipper -> Redis -> Logstash Indexer -> ES
We introduced the redis queue because in the case of traffic spikes our logstash indexer could not handle the load which led to dropped logs. This way we have a nice queue of logs that can be handled later on.
Today we endured an AWS outage which took down our redis and indexer instance. This caused all the Logstash Forwarder logs to pile up on our Logstash Shipper, which did not know what to do with it and thus dropped connections:
{:timestamp=>"2015-07-31T09:32:45.311000+0000", :message=>"Failed to send event to Redis", :event=>#<LogStash::Event:0x6787e62e @metadata_accessors=#<LogStash::Util::Accessors:0x2624ecef @store={}, @lut={}>, @cancelled=false, @data={"message"=>"{\"@timestamp\":\"2015-07-31T11:27:37.076911+02:00\" [...]
{:timestamp=>"2015-07-31T09:33:08.748000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:09.249000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:09.750000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:10.250000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:10.751000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:11.436000+0000", :message=>"CircuitBreaker::Close", :name=>"Lumberjack input", :level=>:warn}
```
When we got a redis instance back up the shipper sent its logs into redis, queueing them up while we launched another indexer process. Once everything was working we noticed a gap of logs for the time redis was not available.
My guess as to what happened is, that the forwarder got an ACK for sending the logs to the shipper. But the shipper could not send the logs to redis and thus dropped the logs.
What can I do to improve the availability? Should I move the redis instance onto the same host as the shipper?