Hi,
we use filebeat (8.5.3) to scrape logs from kubernetes (1.24) nodes and send them to logstash (8.5.3). We have been doing this since filebeat 6.x and have adjusted the configs to the best of our knowledge during the migrations.
Since the update from 7.17 to 8.5.x, we can see an increasing amount of memory usage in filebeat, until it hits the resource memory limit and gets restarted by k8s. We've allowed up to 6GB, which should be more then enough, but that gets used on a regular basis:
We see this behaviour on all the four k8s clusters we run, but they all run the same filebeat config.
filebeat.inputs:
- type: filestream
enabled: true
paths:
- "/var/log/messages"
id: "ec2-messages-log"
tags: ["ec2-messages-log"]
filebeat.autodiscover:
providers:
- type: kubernetes
scope: node
templates:
- condition:
equals:
kubernetes.container.name: example-app
config:
- type: container
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
parsers:
- multiline:
type: pattern
pattern: '^\[(\d{1,2}\/\d{1,2}\/\d{2,4})|(\w{3} \w{3} \d{2})|(\d{2}\/\w{3}\/\d{2,4})'
negate: true
match: after
- condition:
not:
equals:
kubernetes.container.name: example-app
config:
- type: container
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.logstash:
hosts: ["logstash-logstash:5044"]
By deactivating parts of that config step by step, my guess it that filebeat.autodiscover / kubernetes config could somewhat be the reason for the memory leak. Disabling its processors / parsers did not mitigate the issue.
The pattern match for the container logs should be fine and not matching too wide, and seems to be as advised in the documentation. Having read some old posts about filebeat memory usage, it was suggested that the k8s nodes have too many small logfiles open, or rotate them too often.
Checking the log paths at the logs, there are only ~30 logfiles present at a given time.
Here are the json formated metrics that are logged by one filebeat pod shortly before it hits the memory limit:
{
"log.level": "info",
"@timestamp": "2022-12-28T07:57:34.547Z",
"log.logger": "monitoring",
"log.origin": {
"file.name": "log/log.go",
"file.line": 186
},
"message": "Non-zero metrics in the last 30s",
"service.name": "filebeat",
"monitoring": {
"metrics": {
"beat": {
"cgroup": {
"cpu": {
"stats": {
"periods": 299,
"throttled": {
"ns": 6014494525,
"periods": 36
}
}
},
"cpuacct": {
"total": {
"ns": 10627530809
}
},
"memory": {
"mem": {
"usage": {
"bytes": 5544570880
}
}
}
},
"cpu": {
"system": {
"ticks": 1703480,
"time": {
"ms": 860
}
},
"total": {
"ticks": 10968490,
"time": {
"ms": 9990
},
"value": 10968490
},
"user": {
"ticks": 9265010,
"time": {
"ms": 9130
}
}
},
"handles": {
"limit": {
"hard": 1048576,
"soft": 1048576
},
"open": 116
},
"info": {
"ephemeral_id": "dd93b8f2-8632-4c22-ab15-205dcb332ff4",
"uptime": {
"ms": 80580095
},
"version": "8.5.3"
},
"memstats": {
"gc_next": 6229816968,
"memory_alloc": 2781633416,
"memory_total": 678799366512,
"rss": 5580726272
},
"runtime": {
"goroutines": 223821
}
},
"filebeat": {
"events": {
"active": 1,
"added": 159,
"done": 158
},
"harvester": {
"open_files": 14,
"running": 14
}
},
"libbeat": {
"config": {
"module": {
"running": 37
}
},
"output": {
"events": {
"acked": 159,
"active": 0,
"batches": 94,
"total": 159
},
"read": {
"bytes": 564
},
"write": {
"bytes": 142303
}
},
"pipeline": {
"clients": 38,
"events": {
"active": 0,
"published": 159,
"total": 159
},
"queue": {
"acked": 159
}
}
},
"registrar": {
"states": {
"current": 51,
"update": 155
},
"writes": {
"success": 20,
"total": 20
}
},
"system": {
"load": {
"1": 0.7,
"15": 1.12,
"5": 0.99,
"norm": {
"1": 0.175,
"15": 0.28,
"5": 0.2475
}
}
}
},
"ecs.version": "1.6.0"
}
}
The overall numbers of handles seem to be okay-ish to me. The number of runtime/goroutines seems to be way to high (223821), though.
Is there any obvious issue with the config that i've missed? Filebeat had memory issues with that config in 7.x for us before, but they vanished after some restarts. With 8.x, we have them consistently in all k8s clusters, with every restart.
Refering to some older posts, i've also tried adding
queue.mem:
events: 4096
flush.min_events: 16
flush.timeout: 0s
to the config, but the issue remains the same. Rising number of goroutines until the pod gets OOM-killed.
Any advise how to further debug the issue?
Thanks in advance!
Best regards.