Assistance with Logstash Sizing and Kafka Integration

I am new to Elastic and would like to understand the sizing requirements for Logstash based on the following:
• Data to process: 7TB over 10 hours
• Average event size: 70% of events are 1KB, and 30% are 500 bytes

How can I determine the number of Logstash servers needed, considering a persistent disk-based queue? Each Logstash instance will have 16 vCPUs and 32GB of memory. Additionally, how many events per second (EPS) can each Logstash instance handle?
Furthermore, how can integrating Kafka optimize the Logstash deployment? Specifically, can Kafka reduce the number of Logstash instances required? If so, how many instances can be reduced before and after Kafka integration?
Lastly, how does Kafka compare to F5 LTM in terms of handling high throughput?

Thank you for your assistance.

That really depends upon what the pipeline is doing. Enrichment calls using DNS / elasticsearch / geoip/ JDBC / memcache can add tens of milliseconds to the time taken to process a single event.

That said, if you want to move 7 TB in 10 hours, that's 700 GB per hour. If the average event is 1 KB, that is 700 million events per hour, or about 200,000 events per second. I doubt anyone on the planet is moving that kind of volume through logstash. There are almost certainly better architectures for ultra-high-throughput ETL.

Helllo and welcome,

I think you are mixing some things.

F5 LTM is a load balancer, it will distribute the request between multiple destinations, Kafka is event store and stream processing tool that can also be used as a message broker or buffer, they had nothing to do with each other.

You would use a load balance when you need to distribute the events between multiple servers for high availability or because one server alone can not keep up with all events and you would use Kafka when you need to have a buffer of events to have high availability, distribute the processing of the events and deal with event spikes.

It is pretty common to use both Load Balancers and Kafka in combination with Logstash.

Also, adding Kafka has no relation in reducing Logstash instances, on the contrary, normally you add Kafka when you need more Logstash instances to process your data.

As an example, I have something closer to 50k events/s and I use both Load Balancers, multiple Logstash instances and Kafka, some logstash instances act as producers for Kafka, they receive the data and send to Kafka, no parsing is done, other logstash instances act as consumers for Kafka, they get data from Kafka topics, parse it and send to Elasticsearch.

To move something closer to 200k events/s as Badger has calculated, your main bottleneck will probably be your output, you can do that with Logstash but it would not be simple, multiple instances, load balancers and maybe Kafka would be required.

I think that the best way to find what kind of infrastructure you will need is by testing it, also, Logstash is more CPU bounded, it does not make much sense to use more than 8 GB of HEAP for it, so 16 GB machines would be fine.

Thank you all for the valuable feedback.

I would like to get more clarity on the following points:

  • Data to process: 7TB over 10 hours
  • Average event size: 70% of events are 1KB from systems like Windows and Linux with Elastic agents installed, and 30% are 500 bytes from network devices, load balancers, IPS, and firewalls. Firewalls generate the most events, ranging from 12K to 15K EPS.

Can I conservatively assume that a well-optimized Logstash instance with 16 vCPUs and 32GB of memory can handle around 10,000 EPS with an event size of 1KB? If so, here is my calculation:

Number of Logstash Servers: 199,680 EPS / 10,000 EPS = 20 units

Architecture:

Source -------->----------- Kafka x 3 ---->------- Logstash
(Systems)                                                         Logstash
(Network)                                                         Logstash x 20


Lastly, if a physical server is running ESXi with hyperthreading enabled, can I consider each thread as 1 vCPU as well?

Thank you.

Best regards,
William

No, you need to test, the event rate of a logstash instance also depends on your output, so, if you have a total of 200k e/s but your output can only deal with 100k e/s, your logstash will adjust to that as the output will tell the logstash to backoff and this may or may not lead to delays.

It also will depend on your pipeline filters, what parsing and transformation you are planning to do.

The only way to find the best infrastructure for a use case is by testing.

Also, 32 GB machines for a logstash machine is in most case not needed, you should start smaller and increase the machine size if needed, in most scenarios you should not use more than 8 GB of heap memory, so 16 GB machines would be ok.

Another thing is that not every source can send data directly to Kafka, so you may need something between your sources and kafka, this can also be smaller logstash instances.

When use Kafka you should also match the number of partitions in your topics to the number of logstash instances, for example, if you have 2 logstash instances, your topics should have 2 partitions, this helps evenly balance the events.

To size your Logstash deployment:

  1. EPS Calculation: For 7TB of data over 10 hours, assuming 1KB events, the EPS is about 194,444 (rough estimate). Each Logstash instance typically handles 5,000 to 15,000 EPS, so you'll need around 13-39 instances (assuming 16 vCPUs, 32GB RAM).
  2. Kafka Integration: Kafka acts as a buffer to smooth out data spikes and allows Logstash to process data at a steady rate, which can reduce the number of Logstash instances required by offloading event handling.
  3. Instance Reduction: With Kafka, you could potentially reduce the Logstash instances by 30-50% due to better data flow management.
  4. Kafka vs. F5 LTM: Kafka is optimized for high-throughput data streaming, making it more suitable for large-scale log handling, while F5 LTM is better for load balancing rather than data queuing.