Using RabbitMq as broker between Beats and Logstash

Hello, I’ve created an architecture for a SIEM using ELK. In this architecture, I’ve used a RabbitMQ broker between Beats and Logstash. Is it a good choice? My final target is to collect logs from at least 1000 Beats and forward them through RabbitMQ to be processed by Logstash. The use of the broker in this architecture is to bypass the problem of bottlenecking between the Beats and Logstash in case of too many logs. I chose RabbitMQ instead of Kafka because I only need to forward logs from Beats to Logstash and I wanted to avoid the complexity of Kafka deployment. My question is, is it a good choice to use RabbitMQ in this scenario? What are your recommendations?
Thank you

Hi @NasrJBr,

You can directly use logstash in filebeat's output. No need to use rabbitmq unless there is specific need. Logstash will also queued up all the events persistently.

1 Like

As far as I know Beats are not able to output directly to RabbitMQ so I do not think this will work. Either deploy Kafka or send it directly to Logstash. Logstash is able to enqueue on disk, but as it is local disk it does not give you the resiliency that Kafka would offer.

I would say that it is not, mainly because beats cannot output data directly to rabbitmq, so you would need another piece to manage in your infrastructure.

Kafka is a way better choice as you can output from beats directly into a Kafka topic.

I've been using Kafka as a message broker in ELK deployments for so many years that every time I need to spin up a new Elastic Cluster I consider Kafka as an essential part of it.

2 Likes

Hi @ashishtiwari1993, won’t there be a risk of data loss from Beats if there’s a connection or performance issue?

Hi @Christian_Dahlqvist, I believe RabbitMQ is mentioned in the official documentation.

Hi @leandrojmp, okay, so using Kafka is better than RabbitMQ for handling these types of problems. Initially, I considered using direct ingestion, but I took into consideration potential bottlenecks and data loss in case of problems, so that i add broker to my architecture.

The link I provided shows the supported outputs, and RabbitMQ is not on it. There is however a module for collecting logs (possibly also metrics) from RabbitMQ, but that is very different.

Ah, okay. So there is another approach for adding RabbitMQ. From your perspective, you recommend using Kafka instead of RabbitMQ because my final objective is to eliminate bottlenecks and data loss.

Yes, I would recommend using Kafka as I have seen it used successfully in deployments with very high throughput numbers. I am not sure how RabbitMQ performance compares, but believe it is significantly slower.

1 Like

Ok, thank you. One more thing, is there any other alternative to be use instead of a broker to handle the two problems (bottlenecks and data loss)

No, I think using Kafka is the best option. It is a very common pattern and therefore asy to get help around.

1 Like

ok, thank you for you help @Christian_Dahlqvist.

Filebeat guarantees that events will be delivered to the configured output at least once and with no data loss. Filebeat is able to achieve this behavior because it stores the delivery state of each event in the registry file.

Once data delivered to logstash, logstash will write to the disk. So there is no data loss.

In case of hight traffic, you can add more logstash server behind the load balancer.

Hi @ashishtiwari1993, instead of using a broker, can I use a load balancer between Beats and Logstash? Do all Beats support data loss prevention like Filebeat? Also, can you recommend some load balancers for this SIEM use case?

When you write data to Kafka, the data is distributed across the cluster and the los of a single node does not lead to data loss. Logstash only persist data to the local disk, so losing a Logstash node would likely result in data loss.

Another benefit of using Kafka is that it helps distribute the processing load evenly across processes pulling from it. With a large number of Logstash instances you do run the risk of having the Logstash instances very unevenly loaded.

Using Kafka is therefore in my opinion the superior option.

1 Like

Agreed!!! For distributed architecture, kafka is the best. Just curious, distributing same events can led to data duplication (If we're pushing to Elasticsearch, same cluster.)

It depends on how they are distributed, using Kafka this is not an issue as you can use the same group id for each logstash, then the events will not be duplicated.

A good approach is to have the number of partitions on Kafka to be the same as the number of Logstash nodes, this way each logstash node will consume from one partition and the events will be evenly distributed.

3 Likes

When we talk about data loss prevention using a load balancer, it can be inefficient due to overloads from Beats agents. Therefore, the best solution is to use a broker. My question is, what is the best configuration to apply for a Kafka broker with Logstash in the context of a SIEM ?

Why would this be different? You are still shipping data and want to do so in a performant and reliable fashion.