For our Kubernetes cluster we have ELK setup as follows:
Filebeat -> Redis (AWS ElastiCache) -> Logstash -> ElasticSearch
Logstash is on one cluster and ElasticSearch is on a separate Cluster. Both Logstash and ES are on worker nodes dedicated to logging services only. Logstash is configured as 3 K8S pods each with 2 CPU and 3GB of RAM. Heap is set to the default of 1GB.
After running for two to three days, Logstash seems to stop reading from Redis and the "queue" of items in Redis continues growing until the Redis cache fills up (about 3.2M entries is the max, the node is cache.m5.large)
I've read some of the performance tuning and have attached a VisualVM to the Logstash nodes. When it is working properly, I see that CPU utilization is typically under 15% with occasional (once per minute for about 5 seconds) spikes to about 30%. Full GC rarely happens (I haven't seen it happen in 20 minutes of watching) and small GC happens about once per minute.
When it slows down / stops, the CPU utilization is only like 5% pretty much all the time.
I have tried adjusting a few settings:
- Logstash settings going from '-w 4' to '-w 8', then to '-w 16' all using '-b 2000'
- Redis batch count using the default, then 4000, then 10000
- Redis threads using the default, then 4, then 8
Filebeat generates about 5,000 events per minute into Redis and we use Logstash to drop about 4,000 of those (they are TLS events resulting from a TCP healthcheck against a TLS endpoint and we want to keep some, but drop most so we use the Logstash drop set to 95% instead of the Filebeat drop all.) I suppose we could amend the configuration to do Filebeat -> Logstash (filtering/drop) -> Redis -> Logstash (mutate) -> ES. Please let me know if that really would be considered better (we are using Redis as a buffer / queue, so putting LS in front of that seemed to negate the use of Redis.)
I don't think we have a very complicated Logstash setup in terms of mutations / pipelines / etc.
I have Logstash set to INFO log level and there are no messages output after the initial startup. I am at a loss as where to turn next for troubleshooting. Any help is appreciated. I can provide more details as needed.
Logstash 6.8 / ES 6.8 / Filebeat 6.8
Checking one of the three recently restarted (and working) LS nodes (via http://localhost:9600/_node/stats/pipelines?pretty=true) I see it has processed about 1,100,000 events in the past 55 minutes, so about 20K events per minute