HA ELK Stack lost logs

We are currently trying to set up a HA ELK stack for our company. Our current architecture looks like this:

Logstash Forwarder -> Logstash Shipper -> Redis -> Logstash Indexer -> ES

We introduced the redis queue because in the case of traffic spikes our logstash indexer could not handle the load which led to dropped logs. This way we have a nice queue of logs that can be handled later on.

Today we endured an AWS outage which took down our redis and indexer instance. This caused all the Logstash Forwarder logs to pile up on our Logstash Shipper, which did not know what to do with it and thus dropped connections:

{:timestamp=>"2015-07-31T09:32:45.311000+0000", :message=>"Failed to send event to Redis", :event=>#<LogStash::Event:0x6787e62e @metadata_accessors=#<LogStash::Util::Accessors:0x2624ecef @store={}, @lut={}>, @cancelled=false, @data={"message"=>"{\"@timestamp\":\"2015-07-31T11:27:37.076911+02:00\" [...]
{:timestamp=>"2015-07-31T09:33:08.748000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:09.249000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:09.750000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:10.250000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:10.751000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-31T09:33:11.436000+0000", :message=>"CircuitBreaker::Close", :name=>"Lumberjack input", :level=>:warn}
```
When we got a redis instance back up the shipper sent its logs into redis, queueing them up while we launched another indexer process. Once everything was working we noticed a gap of logs for the time redis was not available.

My guess as to what happened is, that the forwarder got an ACK for sending the logs to the shipper. But the shipper could not send the logs to redis and thus dropped the logs. 

What can I do to improve the availability? Should I move the redis instance onto the same host as the shipper?

Logstash's lumberjack input puts received messages into Logstash's internal queue synchronously, and if the output is clogged (in your case because Redis is unavailable) it'll block the pipeline and messages sent over lumberjack won't get acked. That halts the log-reading pipeline on the logstash-forwarder side too. So no, it's not likely that the shipper kept receiving logs and just dropped them on the floor.

Logstash currently doesn't have at-least-once delivery semantics but you should never lose more than 40 messages at once since that's the size of Logstash's two internal 20-item queues.

Now, this is the theory. There could of course be bugs that breaks this ideal behavior.

Did any logfiles get rotated during the duration of the outage? I don't know about logstash-forwarder's log rotation semantics, but at least with Logstash it's totally possible to lose logs when Logstash is down or (presumably) when the pipeline is halted while a logfile is rotated.

@sebastianhoitz

You should read our blog post about ELK in production

I have to say, since this is our business to deliver HA ELK stack that it's not that simple and if you want to achieve this you would have to use a different method to persist logs before indexing them. We use our own services and a very complicated load balancing devices to enable HA.

-- Asaf.