Logstash reading from Redis Slows Significantly After Several Days

justin_spies · October 30, 2019, 4:12pm

For our Kubernetes cluster we have ELK setup as follows:

Filebeat -> Redis (AWS ElastiCache) -> Logstash -> ElasticSearch

Logstash is on one cluster and ElasticSearch is on a separate Cluster. Both Logstash and ES are on worker nodes dedicated to logging services only. Logstash is configured as 3 K8S pods each with 2 CPU and 3GB of RAM. Heap is set to the default of 1GB.

After running for two to three days, Logstash seems to stop reading from Redis and the "queue" of items in Redis continues growing until the Redis cache fills up (about 3.2M entries is the max, the node is cache.m5.large)

I've read some of the performance tuning and have attached a VisualVM to the Logstash nodes. When it is working properly, I see that CPU utilization is typically under 15% with occasional (once per minute for about 5 seconds) spikes to about 30%. Full GC rarely happens (I haven't seen it happen in 20 minutes of watching) and small GC happens about once per minute.

When it slows down / stops, the CPU utilization is only like 5% pretty much all the time.

I have tried adjusting a few settings:

Logstash settings going from '-w 4' to '-w 8', then to '-w 16' all using '-b 2000'
Redis batch count using the default, then 4000, then 10000
Redis threads using the default, then 4, then 8

Filebeat generates about 5,000 events per minute into Redis and we use Logstash to drop about 4,000 of those (they are TLS events resulting from a TCP healthcheck against a TLS endpoint and we want to keep some, but drop most so we use the Logstash drop set to 95% instead of the Filebeat drop all.) I suppose we could amend the configuration to do Filebeat -> Logstash (filtering/drop) -> Redis -> Logstash (mutate) -> ES. Please let me know if that really would be considered better (we are using Redis as a buffer / queue, so putting LS in front of that seemed to negate the use of Redis.)

I don't think we have a very complicated Logstash setup in terms of mutations / pipelines / etc.

I have Logstash set to INFO log level and there are no messages output after the initial startup. I am at a loss as where to turn next for troubleshooting. Any help is appreciated. I can provide more details as needed.

EDIT:
Logstash 6.8 / ES 6.8 / Filebeat 6.8
Checking one of the three recently restarted (and working) LS nodes (via http://localhost:9600/_node/stats/pipelines?pretty=true) I see it has processed about 1,100,000 events in the past 55 minutes, so about 20K events per minute

system · November 27, 2019, 4:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FIlebeat-Redis-Logstash : Filebeat fast and Logstah slow, logstash threading? Logstash	19	3769	February 10, 2017
Logstash consuming messages slow Logstash	14	3549	July 6, 2017
Logstash at 100% CPU, slow to process Redis queue to Elasticsearch Logstash	3	1066	July 6, 2017
Issues with logstash sustained throughput Elasticsearch	2	454	July 6, 2017
Redis in elk Logstash	8	952	February 26, 2021

Logstash reading from Redis Slows Significantly After Several Days

Related topics