Scale Logstash in Distributed Real-Time Way based on Storm


(Hao Chen) #1

Hi Logstash Community,

This is Hao Chen from an e-commerce company supporting one of the largest hadoop cluster in the world. We are relying on logstash to monitor large amount of platform logs. As I could not find similar topic, so here I would like to share some of our practices in scaling logstash in distributed real-time way based on Storm, Spark Streaming or YARN. The main purpose is to enhance ourselves based on the community's great feedback, and also want to know if it's helpful enough to open source our solution or contribute to logstash community.

Problem

As all you know, logstash is a great tool to collect, enrich and transport logs with high usability and extensibility, but limited scalability. Especially while big data distributed systems like hadoop, hbase or spark are becoming more and more popular and widely adopted, how to process extremely large scale of logs in real-time is becoming a more and more serious challenge.

The limitation is mainly because logstash pipeline runs on single process and has been placed as an agent in the past. And currently officially recommended scaling solution (https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html) still seems to have some problems like:

  • Too complex OPS architecture and difficult for maintenance
  • Inflexible for dynamic scaling and rebalancing
  • Relatively weak fault-tolerant and message guarantee

Solution

The solution by our side is to deeply customize logstash pipeline execution framework and deploy logstash pipeline directly onto real-time distributed cluster like Storm, Spark streaming or Semza (Yarn).

Thanks for the clear architecture of logstash, the log processing abstraction paradigm is highly extensible and extremely simple:

Threads[input plugin] -> (sized-queue) -> Threads[filter plugin] -> (sized-queue) -> Threads[output plugin]

So that we could seamless to convert the pipeline into distributed real-time streaming application like storm topology as following:

StormSpout[input plugin] -> (zeromq) -> StormBolt[filter plugin] -> (zeromq) -> StormBolt[output plugin]

The solution mainly proves us following benefits:

  • High scalability and flexible deployment and management
  • Simple and universal pipeline definition syntax for both agent-side data collection and cluster real-time processing
  • Natively support logstash existing plugins and extensions
  • Storm-based fault tolerate and message guarantee

Though here we are talking on storm, the pipeline execution framework is platform-independent in fact, so that we could definitely also extend to other distributed streaming data processing system like Spark streaming or Samza.

Any comments or feedbacks will be highly appreciated. :slight_smile:

Best Regards,
Hao


(Mark Walkom) #2

Thanks for the post!

What you've done sure is interesting, is there any chance you will or have open sourced this?


(Colin Surprenant) #3

Hao,

Great stuff! I would love to get more details on your implementation. Any plans of open sourcing it? I am sure this could be of great benefit to the community.

Thanks,
Colin


(Hao Chen) #4

Yes, and I am researching about the community's feedback. If the solution could be helpful enough to make some benefit for the public community, I'm definitely willing to spend resource to make it more generic and open source it.


(Hao Chen) #5

Colin,

Thanks and really glad to hear that you are interested, and I am exactly posting to research about the community's feedback. If the solution could be helpful enough to make some benefit for the public community, I'm definitely willing to spend resource to make it more generic and even open source it.

Thanks,
Hao


(Colin Surprenant) #6

Hao,

I am not sure I completely understand you current setup and would love to have more details! Are you running in Storm and shelling out to independent partial logstash pipelines in spouts and bolts?

Thanks,
Colin


(John Sotiropoulos) #7

Hello @haoch !
It 's great to hear that there is a solution which "makes" logstash more scalable. That means that you didn't use logstash at all, right? You just simulated what logstash does through storm spouts and bolts.
I have to do the exact same thing in order to monitor robust traffic real time, so it would be great if you could help with your solution. Is your solution somewhere available?! Whole or part of your code or a picture of the architecture of your system (like if you use any brokers, if you use es-hadoop for this, etc)?


(Chintu Kumar) #8

Hi @haoch,
I am looking for something similar to what you have done for one of my applications.
I would like to have a discussion with you about the same.
Kindly get back to me when you are free.

Thanks,
Chintu


(system) #9