When should I use Kafka?


I am capturing logs from filebeat and sending to logstash for filtering.The data is then send to elasticsearch .The results are displayed on kibana.
Data captured per day is about 150MB.

Should I connect kafka with this setup?
Can someone tell me the usecases of kafka with ELK?

Thanks in advance

No, you should not. 99% of the times Kafka is just another moving piece which gives nothing to the overall system apart from being another piece that can potentially fail. People tend to place a Kafka to avoid proper and healthy provisioning of Logstash and / or Elasticsearch nodes.

Thank you :slight_smile:

Kafka is not "just a moving piece".

In our environment we use Kafka as a central point to buffer data so that it can be consumed by several consumers. For example: we collect windows events using winlogbeat and put this in a Kafka topic. From this topic we can create multiple pipelines to transform data, with logstash, so that we can index in Elasticsearch. Also there is a separate pipeline where we send the event.original event (xml) to splunk, using Kafka connect, that functions as our SIEM and there is a third pipeline where we filter and send some of the events to our monitoring solution.
Some people would say: why not use multiple logstash outputs? If you have a small environment this may work but if you process large number of events it is better to split them to multiple pipelines. Also if you use logstash persistance storage and this storage is full, because one endpoint is unavailable, it will stop all processing until that one endpoint is available again.
Another reason we use Kafka is because we want that as soon as an event is generated we want it off the system and we have one Kafka cluster per datacenter location in case of network issues.

Kafka has a function where you can use multiple consumer groups. Each consumer group has its own metadata, this way Kafka remembers wich events are already processed by a certain group.
Typically you keep 7 days worth of data in Kafka. If at any point one of the systems is unavailable, for maintenance or an incident, the events will send the events from where it left off if its within this 7 days window.
A consumer group can be related to a certain pipeline for a specific use-case, but you can deploy the same pipelines parallel based on the amounth of partitions you have in a Kafka topic. By doing so you can greatly increase the total events processed per second by making the processing parallel.

Kafka can provide "exactly-once" processing however if you use Logstash you can only achieve "at-least-once".

The equivalent of a Kafka topic is an Elasticsearch index and the equivalent of a partition is an Elasticsearch shard. However Elasticsearch is not a streaming platform. You will never be able to get data as fast in and out Elasticsearch as you can with Kafka.

Kafka can provide something the elastic stack can't: event streaming. Depending on what you want to achieve and where you want to send your data it can be very usefull.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.