I am working on building a distributed real time cluster system to supervise and analyze a network. I did several researches on internet and I came out with few technologies:
-for real time processing : logstash, storm and apache streaming
-for storage: elasticsearch
-for analysis: Apache Spark over Hadoop (I will use ES-Hadoop to connect with Elasticsearch)
-for data visualization: kibana, D3js, c3js
However, logstash is not often mentioned as spark streaming and storm. I found in internet the following architecture presented in the below picture:
I don't understand why logstash is not often mentioned as a real-tim processing system like spark streaming and storm. What are the main reasons ? I hav been using it and it is very powerful..
2)Regarding the Analyze part, can I use the machine learning librairies in that configuration ?
Maybe my question is not clear but what I am asking is what might be the main reasons not to choose logstash with respct to spark streaming and storm ? It is very difficult for me to answer to this question since I don't find any comparison in internet.
I think the reason is the following: logstash just forwards the data to ES. You can filter, rewrite data with GROK beforehand of course but yet ES will do the indexing.
If you put Spark instead, Spark will perform some indexing and off-load ES indexing. You can perform bunch of other cool stuff with Spark like machine learning library SparkSQL and etc too.
If you have a moderate infrastructure and data load, you may stick with logstash->ES->Kibana stack. However if your data is huge and want to do some fancy stuff with increased capabilities, you might want to choose the following:
Logstash->HDFS->Spark<->ES->Kibana (notice that Spark and ES relationship is unidirectional. You may preindex and off-load ES as well as you may get data from ES for further analysis. This URL has some explanation: http://stackoverflow.com/questions/31726409/what-is-elasticsearch-hadoop-es-hadoop-and-its-benefit-over-hbase-for-a-live-w)
I'm new to these concepts too, I'm still trying to understand possible architecture variations, tools and yet these are what I've understood so far, hope that it helps
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.