Logstash Performance Leak When Consuming Kafka Topic into Elasticsearch

davidb · March 9, 2021, 11:39am

Greetings, I am facing a problem regarding logstash perfomance degradation over time. Any advice or hint is welcome. Thanks very much in advance.

Short summary

Logstash suffers from a perfomance leak over time. The data stream has two major peaks within a day that are roughly six hours apart from each other. For some reason, logstash does not manage to free ressources allocated for handling peak load of incoming data after the peaks are processed. Affected ressources are mainly CPU and RAM consumption. This leads to a build up in allocating more and more ressources. After roughly three days and more this results in a substantial delay for incoming data in elasticsearch. With overall increasing data load the delay sets in even earlier.

Details about environment

Logstash version 7.9.3
Two Logstash instances run as Docker containers on two VMs
OS of each VM is RHEL 7
Each VM is provisioned with 8 CPU cores and 16 GB RAM
Kafka topic is split up into 8 partitions
Peak data load is around 10k data points / s
Logstash does not carry out any substantial filter logic (only two debug fields are added; has also been turned off and this did not show any effect)
Almost all configurations (besides the ones discussed in the next section) are default values
This problem did never occur when using Logstash version 6.x

Counter Measures

Limit RAM configuration to 8 GB per instance [1]. As this measure showed no effect RAM has been set back to 12GB per instance
Limit number of workers to number of topic partitions (4 for each VM/instance). Did not show any effect but it should make sense anyway.
Do not configure Kafka consumer threads explicitly (This was done to avoid unbalanced partition distribution between the two instances). Did not show any effect.
Adapt batch size from default value (125) to 1000 [2]. This did show some effect. JVM heap usage was much higher on one Logstash instance, for some reason not on the other Logstash instance although both instances were identically configured (c.f. section screenshots). Kafka partitions that were consumed from Logstash instance with weak usage of available JVM heap showed a substantial build up in consumer lag. This resulted in a delay for incoming data in elasticsearch of about half an hour.

Current workarounds

Split up single "bigger" Logstash instance per VM into four smaller ones. This smees to mitigate or at least postpone the perfomance leak
Restart the instances from time to time

Screenshots

VM/Instance using JVM Heap

VM/Instance with limited JVM Heap usage

Links

[1] Performance Troubleshooting | Logstash Reference [7.9] | Elastic
[2] Tuning and Profiling Logstash Performance | Logstash Reference [7.9] | Elastic

system · April 6, 2021, 11:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash consuming huge amount of CPU - mem leak? garbage collector? Logstash	8	2483	July 6, 2017
Logstash Performance Tuning Logstash	4	1353	July 21, 2020
Logstash running on machine has high load average Logstash	4	502	June 10, 2022
Tuning Logstash for optimal throughput for ELK pipeline Logstash	4	389	March 27, 2020
High Logstash CPU Usage Logstash	2	1087	July 6, 2017