Maximum throughput Logstash can handle

I have a very huge and old log gathering system which is based on daily cronjobs and HDFS.
This system is not good for real-time analysis so I decided to introduce Logstash.

My system produces 850GB of log each day.
It means I have to process almost 100K lines per second for average, and it even goes over 300K+ in workhour.

I built a test bench with decent H/W and tested how much log it can handle. (FileBeat -> Logstash -> WebHDFS)
With lots of adjustment and tries, 50K eps was the best result I've made on the test environment.
I found it goes nearly 100K with Generator Input Plugin and File Output Plugin but not a real-world scenario.

Is there any use case performing 300K+ eps with Logstash?

As you noticed, without network latency to slow LS down, it can achieve 100K eps.

Generally, when building a high incoming volume solution most folks end up building a multistage parallelized system to spread the load.
Something like:
Filebeat(s) -> LS -> Kafka Cluster -> LS (multiple) -> WebHDFS

Its possible to have some filebeat connect to one of 2 or 3 "shipper" LS instances that connect to a separate Kafka topic and 2 or 3 "ingest" LS instances per Kafka topic in a ConsumerGroup pull from Kafka.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.