I've tried to look up every available documentation, example, blog post that's available on the net but I couldn't find anything useful on how to speed up ingesting from Kafka. Logstash barely uses any resources (10% of 16vCPU, 1GB RAM), same for Kafka. Nodes are set up in the same data center so network isn't a bottleneck either. The only thing that changed the throughput was the "consumer_threads" setting which is set to add up to the total number of partitions (15 per LS instance). When I changed anything other (e.g., fetch_max_bytes => 157286400) nothing happened, so I removed those leaving at the default setting.
I did read similar issues from the past but those were mostly unanswered. Unforunately, the partitions show significant lag (offset -X million) and I can't work out how to solve this since clearly throwing resources at the problem isn't a solution. I also couldn't figure out if I set something in the Kafka input configuration, would that change the limits on the Kafka server? Naturally, I elevated the limits on the Kafka servers as well, I was just wondering.
What am I missing? How can I squeeze out more (a lot) throughput from Logstash?
If your output can't keep up with the ingest rate, it could be the bottleneck.
I have a similar, but smaller, scenario where I have a Kafka input and an elasticsearch output and the lag that I have sometimes is causes by my elasticsearch nodes that are not able to keep up with the indexing rate.
Yes, it's Elasticsearch, Though there are a lot of free resources as well. I know for a fact that after a hiccup, it can index at least 1.5 times more (rate/s) than now. ES nodes are really beefy and if it'd be the bottleneck than I wouldn't see the daily "wave" for indexing rate. At least I'd think so. I'm fortunate regarding resources and as noted before, the VM's are in the same data center.
Well, do you have any monitoring of your cluster? What is the storage type of your data nodes? HDD or SSD? How many IOPS can it reach?
How many shards do your index have? How many replicas? What is the refresh_interval setting for your index?
You will need to troubleshoot your pipeline to find where is the bottleneck.
I would focus on looking at the elasticsearch side, which in my experience could be the bottleneck, see how the elasticsearch nodes are behaving when you star to get lag in your partitions, look at the CPU, Memory and the IOPS of the system.
There are some tips that you can follow to tune your cluster for indexing
Thanks for getting back! As I mentioned above, I don't have any issue with hardware resources. The Elasticsearch nodes have SSD disks, plenty of memory and CPU to tap in. There isn't any replica shard and sharding/mapping settings are set to perform as best as possible.
IF the Elasticsearch cluster would be the bottleneck, the indexing rate wouldn't stop after peak time but stay at the same rate until there isn't any lag.
However, that's not the situation. I simply can't fetch more than X events from a specific Kafka topic only if I start pratitioning those more and throw more workers at it. Though I don't think that it's the only option. I don't have too much experience with Kafka so it's possible that there's something set there (one of the default options) which I can change but as far as I got in their docs and various online articles, I elevated every throughput related option.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.