It depends on what exact meaning of "distributed" you mean. Logstash doesn't have clustering support in the same manner as e.g. Elasticsearch, but that doesn't mean you can't run multiple Logstash instances to process a single stream of input events.
The meaning of "distributed" here is the one of a distributed real time computing system (for instance Apache Storm). So, If I understand, logstash is not a distributed system. Therefore, it can mean that i might get bad perfomances with logstash compared to another distributed system as apache storm or spark streaming ?
Futhermore, how does it setup multiple Logstash instances ?
The meaning of "distributed" here is the one of a distributed real time computing system (for instance Apache Storm). So, If I understand, logstash is not a distributed system.
You're explaining the word "distributed" only by giving Apache Storm as an example, so you're not really explaining it at all. I think this discussion would be more fruitful if you expressed what you're looking for and what you want to accomplish.
Therefore, it can mean that i might get bad perfomances with logstash compared to another distributed system as apache storm or spark streaming ?
Maybe, maybe not.
Futhermore, how does it setup multiple Logstash instances ?
I'm not sure what you're asking. Do you want to know how to start multiple independent instances of Logstash or how to get a number of such instances to process data from a single source? Again, give us more background.
Sorry Magnus for not being so clear !!
In my case distributed means to divide and parrallelize some processes over several computers.
I would like to setup a real-time analytic system and I want to deploy it in a cluster. Indeed, it needs to collect, process and visualize data contained in a kafka server. In order to collect and process my data, I first used Logstash and then I stored the processed data in elasticsearch.
I read on internet about Apache Storm and Spark Streaming that are very popular as distributed real-time computing system. Since logstash is not distributed, my big concern about Logstash is that it might not be sufficient to treat fast enough input data, not scalable and not suitable for a distributed environment.
Regarding the last question I want to know how to get a number of such instances to process data from a single source
Since logstash is not distributed, my big concern about Logstash is that it might not be sufficient to treat fast enough input data, not scalable and not suitable for a distributed environment.
Again, you need to understand that "Logstash is not distributed" isn't a useful comment. It's ambiguous and might not even be relevant to the discussion.
Regarding the last question I want to know how to get a number of such instances to process data from a single source
That depends on the source, but assuming you're running some kind of broker (like Kafka) you can run multiple Logstash instances and point them at different queues/topics/partitions (terminology depending on the broker) and achieve parallel processing of data.
Sorry again to talk about that but distribution worries me (maybe there is something I don't understand). Indeed, to be distributed is a good thing for scalability (the amount of input data gets increased)...
What I am trying to understand is why some big companies like twitter, Predikto using elasticsearch use storm or spark streaming instead of logstash.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.