Greetings, I am facing a problem regarding logstash perfomance degradation over time. Any advice or hint is welcome. Thanks very much in advance.
Logstash suffers from a perfomance leak over time. The data stream has two major peaks within a day that are roughly six hours apart from each other. For some reason, logstash does not manage to free ressources allocated for handling peak load of incoming data after the peaks are processed. Affected ressources are mainly CPU and RAM consumption. This leads to a build up in allocating more and more ressources. After roughly three days and more this results in a substantial delay for incoming data in elasticsearch. With overall increasing data load the delay sets in even earlier.
Details about environment
- Logstash version 7.9.3
- Two Logstash instances run as Docker containers on two VMs
- OS of each VM is RHEL 7
- Each VM is provisioned with 8 CPU cores and 16 GB RAM
- Kafka topic is split up into 8 partitions
- Peak data load is around 10k data points / s
- Logstash does not carry out any substantial filter logic (only two debug fields are added; has also been turned off and this did not show any effect)
- Almost all configurations (besides the ones discussed in the next section) are default values
- This problem did never occur when using Logstash version 6.x
- Limit RAM configuration to 8 GB per instance . As this measure showed no effect RAM has been set back to 12GB per instance
- Limit number of workers to number of topic partitions (4 for each VM/instance). Did not show any effect but it should make sense anyway.
- Do not configure Kafka consumer threads explicitly (This was done to avoid unbalanced partition distribution between the two instances). Did not show any effect.
- Adapt batch size from default value (125) to 1000 . This did show some effect. JVM heap usage was much higher on one Logstash instance, for some reason not on the other Logstash instance although both instances were identically configured (c.f. section screenshots). Kafka partitions that were consumed from Logstash instance with weak usage of available JVM heap showed a substantial build up in consumer lag. This resulted in a delay for incoming data in elasticsearch of about half an hour.
- Split up single "bigger" Logstash instance per VM into four smaller ones. This smees to mitigate or at least postpone the perfomance leak
- Restart the instances from time to time
VM/Instance using JVM Heap
VM/Instance with limited JVM Heap usage