We've deployed the Filebeat, ELK stack in our Kubernetes cluster (using Azure Kubernetes Service). The setup is: Filebeat -> Kafka -> Logstash -> ES <- Kibana.
However, we noticed a very significant lag. It seems the consumer just doesn't keep up with the rate of production from the producers.
To further isolate the issue, we installed Kafka exporter and built a Grafana dashboard to monitor the consumergroup offset, offset rate and lag.
To further nail down the problem, we deployed ES and Logstash outside of the cluster on their own VMs and tried combination of them. Here are our observations:
- when logstash is deployed in Kube using in-kube ES cluster, consumer rate is very low (~100/s)
- when logstash is deployed in Kube using standalone ES, consumer rate is also very low
- when logstash is deployed in its own VM and use standalone ES, consumer rate is about 7k/s
- when logstash is deployed in its own VM and use in-kube ES, consumer rate is also about 6.5k/s
This gives us the suspicion that the logstash pod in kube isn't given the right resource. The VM we deployed logstash to is 4vCPU, 16G RAM, but logstash isn't using anywhere near that amount of resource.
So we increased the resource requests/limits for the logstash pod in kube to 2CPUs and 4G RAM. The consumer rate is still abysmal (no visible difference, still about 100/s). Looking at the resource usage (using kubectl top pod), doesn't look like logstash is saturating the resources we're giving it either.
I'm kinda out of ideas here. Any pointers?
We tried both Logstash 6.2.2 and 7.1.1. The JVM options are:
-Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djruby.compile.invokedynamic=true -Djruby.jit.threshold=0 -XX:+HeapDumpOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom