We have logs infrastructure, built in this form:
- Filebeat agents consume logs and output them to one topic in Kafka. We are talking about ~50k messages per sec. This part of the architecture works great.
- We have 50 partitions for this topic, which means we have 50 Logstash instances, each instance has one consumer thread.
- Our Logstash configuration is pretty simple, without many GROKs. Configuration will be added as a comment for this thread (both logstash.conf and logstash.yml)
Also - We used at first the default queue (in-memory), it caused to massive lags on Kafka (millions!). We changed the queue to disk based (persistent), and now we have no lags at Kafka at all - but we fear that the lags are on the Logstash queue now.
- We output everything to ES, splitting all the logs into 4 indices.
- Our ES structure - 2 client nodes (8 cores + 32GB mem), 3 data master nodes (8 cores + 32GB mem).
- ES indexes docs at the rate of ~30k per sec, ~10k per each data node. At this point, the clients collapsing because it reached the heap size (85%+).
- We also made some tests and found out the Elasticsearch is our bottleneck.
Now, I have a few questions I would like to ask:
- How can I monitor my disk based queue on Logstash? I need to be sure we are not accumulating a lag inside Logstash which can cause the disk to blow up (and probably to various other failures).
- Why outputting to ES is so slow, and how can I make this process be faster? We tried to change number of output workers but it didn't improve anything.
- For dealing with an aprrox amount of 50k+ msgs per sec, what is your suggested ES cluster size?