Filebeat agents consume logs and output them to one topic in Kafka. We are talking about ~50k messages per sec. This part of the architecture works great.
We have 50 partitions for this topic, which means we have 50 Logstash instances, each instance has one consumer thread.
Our Logstash configuration is pretty simple, without many GROKs. Configuration will be added as a comment for this thread (both logstash.conf and logstash.yml)
Also - We used at first the default queue (in-memory), it caused to massive lags on Kafka (millions!). We changed the queue to disk based (persistent), and now we have no lags at Kafka at all - but we fear that the lags are on the Logstash queue now.
We output everything to ES, splitting all the logs into 4 indices.
ES indexes docs at the rate of ~30k per sec, ~10k per each data node. At this point, the clients collapsing because it reached the heap size (85%+).
We also made some tests and found out the Elasticsearch is our bottleneck.
Now, I have a few questions I would like to ask:
How can I monitor my disk based queue on Logstash? I need to be sure we are not accumulating a lag inside Logstash which can cause the disk to blow up (and probably to various other failures).
Why outputting to ES is so slow, and how can I make this process be faster? We tried to change number of output workers but it didn't improve anything.
For dealing with an aprrox amount of 50k+ msgs per sec, what is your suggested ES cluster size?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.