I have a very huge and old log gathering system which is based on daily cronjobs and HDFS.
This system is not good for real-time analysis so I decided to introduce Logstash.
My system produces 850GB of log each day.
It means I have to process almost 100K lines per second for average, and it even goes over 300K+ in workhour.
I built a test bench with decent H/W and tested how much log it can handle. (FileBeat -> Logstash -> WebHDFS)
With lots of adjustment and tries, 50K eps was the best result I've made on the test environment.
I found it goes nearly 100K with Generator Input Plugin and File Output Plugin but not a real-world scenario.
Is there any use case performing 300K+ eps with Logstash?
As you noticed, without network latency to slow LS down, it can achieve 100K eps.
Generally, when building a high incoming volume solution most folks end up building a multistage parallelized system to spread the load.
Something like: Filebeat(s) -> LS -> Kafka Cluster -> LS (multiple) -> WebHDFS
Its possible to have some filebeat connect to one of 2 or 3 "shipper" LS instances that connect to a separate Kafka topic and 2 or 3 "ingest" LS instances per Kafka topic in a ConsumerGroup pull from Kafka.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.