I am planning to deploy filebeat (and some other beats such as the winlogbeat and maybe the packetbeat) with logstash and elasticsearch and I was thinking that it would make sense to have a queuing system so that I have redundancy as well.
I read in a previous thread that logstash version 5.0 would have some sort of queuing system in memory so that it has more than 20 (?) messages in memory simultaneously.
I did some research and I have to say that I am a bit confused.
Should I use a queuing system or should I just sent everything to multiple logstash instances?
I went over the specs of Redis and RabbitMQ and it looks like Redis is a winner however it isnt so necessary.
Well, I am less familar with Logstash 5.0 but been using it for a long time.
the "Queuing" really depends on your unique needs. To go in to a little bit more explanation here is some info to chew on
I do between 8k to 20k messages a second. So I have logstash write to "Kafka" first and then have an "Indexing" logstash read from the que and write to Elasticsearch.
Here is the flow
file -> logstash -> kafka -> Logstash indexer -> Elastic
Now I use Kafka which is benchmarked at 100K messages/s where as RabitMQ is I think at like 20K/s I opted for kafka
I used to use redis and that worked just as good but had to do some tunning to get it working the way on wanted where kafka worked the default way (Minus the learning curve)
so with all that technical stuff here is how I would choose
very low volume and very low critical messages no queue is needed and keeps the architecture simple
High volume or No loss of messages Redis or any other tech would be nice
2a Kafka and I believe RabitMQ can be configured to "Replay" old messages if you have lost data or want to rebuild you index.
Finally: I do recommend a Queue System because it allows you to stop your Elasticsearch cluster or Indexer servers at anytime and never have to touch your Logstash receiver/Beats. This is because the will Queue up in Kafka/Rabit/Redis until you start your indexing Logstash instances again.
This of it this way, I have 400 Servers if I had to stop Filebeat on all of them and start them later. thats a pain and chances of something getting missed or lost might be high.
But with the Queuing system, if I want to do work on Elasticsearch or the Indexing Nodes, I can at anytime. The messages will just be queued up.
@xon You might find this article useful as you consider adding a queuing system: Just Enough Kafka for the Elastic Stack Since that article was written, Beats 5.x added support for writing directly to Kafka which can further simplify the architecture to Beats -> Kafka -> Logstash -> Elasticsearch.
The problem that I am having is that it is a bit hard to define the requirements. From what I know so far the 100k messages per second looks more than enough what I will need however what Ed said about keeping up the queuing system and update/upgrade logstash and elasticsearch looks like a significant advantage. Still I am trying to figure out how I can configure SSL/TLS so that the logs arent sent in cleartext (with Redis!)
Well, not really knowing about requirements, makes it a little difficult to give any tips.
Some things to consider to find requirements and figure out potential edge cases:
edge nodes lifetime: Is it a standalone server you can keep logs a little longer if something goes wrong? Is it some VM/container which might be destroyed at any time with logs potentially gone missing? The later case will require a queuing systems to get logs of the system as fast as possible. Even consider kafka or a redis cluster with load-balancing in case one or two queue instances are down.
edge node space requirements/availability (be generous on requirements, as logs can become out of hand on incidents - e.g. hundreds of stack-traces):
how much space can be reserved on edge nodes for logs to be written
which time-range can be made available by logs on edge nodes
consider log rotation schedule (filebeat keeps files open until close_older to ensure it has read everything). This can increase space requirements without you knowing
how acceptable is it to loose logs (e.g. a file not yet shipped gets deleted by log-rotation)
Is data loss in queuing system tolerable? e.g. logstash has pipeline of 20 events, if it goes down events might be lost (persistent queuing in logstash will solve this). Or redis keeping everything in main memory might loose all data on crash (potential OOM in redis)?
How does queuing system cope with filebeat sending loads of data, but logstash collecting data being down for some amount of time? e.g. kafka default settings might run out of disk space (policies can be changed to take size into account -> potential data loss).
For securing redis with ssl/tls, consider using stunnel with redis. The beats redis output supports TLS and can directly connect to stunnel (it's actually done this way in beats integration tests).
Finally: I do recommend a Queue System because it allows you to stop your Elasticsearch cluster or Indexer servers at anytime and never have to touch your Logstash receiver/Beats. This is because the will Queue up in Kafka/Rabit/Redis until you start your indexing Logstash instances again.
The original files will also serve as a buffer so in many cases it's not necessary to have all events pass through a broker. If Filebeat can't connect it'll back off and try later. However, it's always nice to get the logs off of the edge nodes but the benefit needs to be weighed against the complexity of operating a broker.
This of it this way, I have 400 Servers if I had to stop Filebeat on all of them and start them later. thats a pain and chances of something getting missed or lost might be high.
If you're managing Filebeat on 400 machines by hand you're working way too hard. Also, it's not clear why you'd want to stop Filebeat in the first place. It'll back off on its own.
you are correct that files do buffer on the local host, but when you fire
back up your broker/elk you will receive a flood of data all at once. yes
there would be some backoff but still , if you have 400 hosts a small
trickle on each host is a huge flood on the receiver.
i am not saying i am managing by hand 400 hosts but even automated tools
will fail to restart instances. personaly i have ansible setup for managing
hosts. plus i setup a heartbeat mechinism to restart hosts that have not
communicated in a while. (i use logstash not filebeat but same xoncept can
be applied) works really well
but either way there are lots of micro decisions you need to make while
deploying you cluster that influnce the design and behavior. i am just
pointing out what works well for me and things i have dealt with over the
years of running a high volume site
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.