I've been working to convert an ELK setup over to using Logstash as a parsing engine, from something home-grown (don't ask). I'm running into a problem with the performance of the Kafka input.
Versions: Everything 7.6.x
Logs go from rsyslog to a 6-node Kafka setup. I've played with different partition sizes but right now there are 24 partitions for the topic I care about. Replication factor 2.
On the 6 nodes there is also a copy of Logstash. These nodes are 15 core, configured with 30 workers. Was originally default but upped it in an attempt to increase performance.
The Kafka input has consumer_threads set to 4.
My basic problem is I cannot pull from Kafka fast enough. If I use kafka-consumer-groups.sh to watch partition lag, it just goes up and up over time. I'm pushing between 30k-40k messages into Kafka in prod.
Early on in this project my CPU's were pegged and I traced that to some bad groks. Now, my CPU's run maybe 40% average with most of my logstash threads idle.
The problem is not:
- my Filters....I've checked them extensively and CPU usage is fine.
- my Elasticsearch back end I'm sending to. When I put the Logstash setup in play I can watch my ES setup ingest ~20k per second....but when not using logstash I have seen this system ingest well over 100k per second in tests.
I've read so much conflicting information on having Logstash pull from Kafka. I need more, and am not sure how to accomplish it. Some things say more partitions, then some things say that's unlikely to help. I have idle CPU and want to put it to work
What is the current recommendation for maximizing Kafka-pull-performance?