Queueing LogEvents in Redis or similar needed?

Hi,

I sometimes find elastic architectures where Logstash, which parses the logs, is located behind a redis or kafka instance.
What are the advantages of such a setup?

Currently we are shipping our logs via filebeat (running on source system) which ships to our processing logstash instance.

I also saw architectures with two chained LS instances, one for collecting the data first, then it is forwarding to the 2nd LS which processes and sends to ES.
What are the advantages of such an architecture?

Thanks and Regards, Andreas

Hi Andreas,

Message Queue system (like the ones you mentioned) are mostly used for resilience, decoupling of services and avoiding back-pressure to reach your Logstash collectors.

Without those, if for whatever reason your processing Logstash nodes fail, your data storage can't keep up with the rate or any other abnormal situation happening you'll be effectively dropping logs most of the time due to cascading effects.

Another minor convenience is, if you ever want another component to consume the same logs you just point that component to the appropriate Kafka/Redis instance and you're good to go, without having to manually link Logstash and that component together.

So they are not mandatory for the most part, but it's a very good practice to design your infrastructure layout with those in mind.

As for chained Logstash instances, I believe this is mostly because you'd want your collectors as lightweight as possible. You need the data inside your infrastructure, where you can them process them at your leisure.

Once you scale out and have many sources and Logstash instances, having direct connections can cause imbalances and make it more difficult to operate. It is often at this point a message queue is introduces in order to allow buffering, but also to decouple the collection layer from the processing layer. With this setup a pool of Logstash instances can pull as quickly as possible from a central queue, which distributes load quite well.

Communicating between two Logstash instances both encrypts and compresses the traffic, so I have seen this used when different teams are responsible for different parts of a processing pipeline as well as when data need to be shipped efficiently between geographically distributed data centres.

The architecture that puts a broker in between two LS instances concerns buffering and parallelism. The informal naming for the two LS instances is "shipping" and "ingestion".

Many people have v complex configs to transform and enhance events before sending to ES. These LS instances usually run slower than a simple config.

The shipping side usually runs quite fast - so because there is a difference in incoming and outgoing rates some kind of resilient (clustered, persisted) buffer between shipping and ingestion is needed.

Also, because Kafka allows for multiple Consumers (the LS kafka input) to consume via a ConsumerGroup on a Topic, it is possible (and good) to have 2,3,4 LS instances to pull documents from the same topic without duplication - parallelism.

Thanks for the reply.

I can understand that if logstash becomes too slow / unavailable that it effects the server where filebeat runs, because the retrying of shipping the logs to logstash eats performance.

But I don't understand the following:

Why will anything be dropped? As I understand filebeat is taking care of that. It remembers which log lines have been shipped last and if a line could not be delivered it will be resend by filebeat.

There are other architectures where the dropping of data is possible.
Broadly speaking there are two kinds of sources:

  1. Persistent and replay-able - File (Filebeat), Kafka, RabbitMq, SQL databases, S3 etc.
  2. Transient - Networked Machine to LS connections without spooling or buffering in between. This includes devices like firewalls and machines that are ephemeral like Docker instances where the design does not preserve the log files after the instance has been destroyed.

However, bugs have to be taken into account. If, say, a Kafka input has pulled some data and the offset is updated but a bug causes LS to lose the event before it is put into the destination (ES or the Dead Letter Queue) then that data cannot be replayed without human intervention.

These bugs, more often than not, are manifest as a direct result of high volumes of data inflow with LS/ES configurations that cannot keep up with this volume.

Not every installation is designed for high volume inflow from the get go. Some people start with a performant configuration that works for a test/trial volume and then slowly connect more and more parts of the business operation to the ingest infrastructure - until :boom:. In other cases the inflow volume exhibits peaks of high volume that occur daily or as a result of a special event.

To be honest, Logstash does not have a great story around pre-emptive alerting when such a limit is approaching (its coming), people have to roll their own at present. There is no Scotty to warn "She cannae take any more, Captain!".

We, Elastic and the Logstash team, are putting enormous efforts into changing Logstash into a turn-key high volume capable product but the code surface area (and configuration permutations) are large.

1 Like

Thanks a lot.
Now I think I have a quite good overview of the topic.

I will keep that in mind if we encounter problems / if we are building up new environments or do noticeable changes in volume.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.