Hi Logstash Community,
This is Hao Chen from an e-commerce company supporting one of the largest hadoop cluster in the world. We are relying on logstash to monitor large amount of platform logs. As I could not find similar topic, so here I would like to share some of our practices in scaling logstash in distributed real-time way based on Storm, Spark Streaming or YARN. The main purpose is to enhance ourselves based on the community's great feedback, and also want to know if it's helpful enough to open source our solution or contribute to logstash community.
Problem
As all you know, logstash is a great tool to collect, enrich and transport logs with high usability and extensibility, but limited scalability. Especially while big data distributed systems like hadoop, hbase or spark are becoming more and more popular and widely adopted, how to process extremely large scale of logs in real-time is becoming a more and more serious challenge.
The limitation is mainly because logstash pipeline runs on single process and has been placed as an agent in the past. And currently officially recommended scaling solution (Deploying and Scaling Logstash | Logstash Reference [8.11] | Elastic) still seems to have some problems like:
- Too complex OPS architecture and difficult for maintenance
- Inflexible for dynamic scaling and rebalancing
- Relatively weak fault-tolerant and message guarantee
Solution
The solution by our side is to deeply customize logstash pipeline execution framework and deploy logstash pipeline directly onto real-time distributed cluster like Storm, Spark streaming or Semza (Yarn).
Thanks for the clear architecture of logstash, the log processing abstraction paradigm is highly extensible and extremely simple:
Threads[input plugin] -> (sized-queue) -> Threads[filter plugin] -> (sized-queue) -> Threads[output plugin]
So that we could seamless to convert the pipeline into distributed real-time streaming application like storm topology as following:
StormSpout[input plugin] -> (zeromq) -> StormBolt[filter plugin] -> (zeromq) -> StormBolt[output plugin]
The solution mainly proves us following benefits:
- High scalability and flexible deployment and management
- Simple and universal pipeline definition syntax for both agent-side data collection and cluster real-time processing
- Natively support logstash existing plugins and extensions
- Storm-based fault tolerate and message guarantee
Though here we are talking on storm, the pipeline execution framework is platform-independent in fact, so that we could definitely also extend to other distributed streaming data processing system like Spark streaming or Samza.
Any comments or feedbacks will be highly appreciated.
Best Regards,
Hao