Hi.
We're running the following configuration of filebeat
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: REDACTED-filebeat-filebeat
spec:
... REDACTED
template:
metadata: REDATCED
spec:
containers:
- args:
- -e
- -E
- http.enabled=true
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
image: docker.elastic.co/beats/filebeat:8.12.2
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: 100m
memory: 100Mi
securityContext:
privileged: false
runAsUser: 0
volumeMounts:
- mountPath: /usr/share/filebeat/filebeat.yml
name: filebeat-config
readOnly: true
subPath: filebeat.yml
- mountPath: /usr/share/filebeat/data
name: data
- mountPath: /var/lib/docker/containers
name: varlibdockercontainers
readOnly: true
- mountPath: /var/log
name: varlog
readOnly: true
- mountPath: /var/run/docker.sock
name: varrundockersock
readOnly: true
volumes:
- configMap:
name: REDACTED-filebeat-filebeat-daemonset-config
name: filebeat-config
- hostPath:
path: /var/lib/REDACTED-filebeat-filebeat-logging-data
type: DirectoryOrCreate
name: data
- hostPath:
path: /var/lib/docker/containers
type: ""
name: varlibdockercontainers
- hostPath:
path: /var/log
type: ""
name: varlog
- hostPath:
path: /var/run/docker.sock
type: ""
name: varrundockersock
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 50%
type: RollingUpdate
With the following config:
filebeat.autodiscover:
providers:
- type: kubernetes
hints:
enabled: true
default_config:
type: filestream
id: kubernetes-container-logs-${data.kubernetes.pod.name}-${data.kubernetes.container.id}
take_over: true
enabled: false
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log # CRI path
parsers:
- container:
stream: all
format: auto
prospector:
scanner:
symlinks: true
add_ressource_metadata: # Considers namespace annotations for hints
deployment: false
cronjob: false
namespace:
include_annotations:
- "nsannotations1"
processors:
- rename:
ignore_missing: true
fail_on_error: false
fields:
- from: kubernetes.labels.REDACTED
to: REDACTED
- from: kubernetes.labels.REDACTED
to: REDACTED
- from: kubernetes.labels.REDACTED
to: REDACTED
- from: kubernetes.namespace_labels.REDACTED
to: REDACTED
- from: kubernetes.namespace_labels.REDACTED
to: REDACTED
- from: kubernetes.namespace_labels.REDACTED
to: REDACTED
- copy_fields:
fields:
- from: REDACTED
to: REDACTED
when:
not:
has_fields:
- REDACTED
output:
kafka:
version: 2
hosts:
- REDACTED:9093
- REDACTED:9093
topic: 'REDACTED-%{[REDACTED]:REDACTED}-%{[REDACTED]:REDACTED}'
ssl:
certificate_authorities:
- |
REDACTED
On that K8S cluster, there is only one test pod using filebeat for log collections, which logs 2 messages
every 30 seconds.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: test-filebeat
name: test-filebeat
spec:
selector:
matchLabels:
app: test-filebeat
template:
metadata:
annotations:
co.elastic.logs/enabled: "true"
labels:
app: test-filebeat
REDACTED
spec:
containers:
- command:
- /bin/sh
- -c
- "while true\ndo\n echo \"Bip bop I'm a Pod, my name is ${NAME} and I'm
testing filebeat (with kafka output) log collection\" \n date\n sleep
${NAP_SECONDS}\ndone\n"
env:
- name: NAP_SECONDS
value: "30"
- name: NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: alpine
Filebeat appears to steadily consume more memory, which leads me to conclude to a leak somewhere.
Here is the graph of the memory (precisely the metric is container_memory_working_set_bytes
) for the filebeat container which collect that logs (reminder: 2 events each 30 seconds). (middle line in the graph, the other are the other filebeat container, not collecting anything)
As you can see, at first it stays in the range 200Mi / 400 Mi (which is already fairly high) but then grows to a minimum of approximately 300Mi, with peaks as high as 700Mi.
The last 24h were eventful, as our control plane had some restart due to the underlying inftrastructure, which cause filebeat to consume still more memory and be OOMkilled twice :
(I can't post two graphs at once being a new user, I'll add it in a comment)
While this is unusual (the control-plane in bad shape), I don't think that should cause filebeat to explode as well, especially since that means we could loose control-plane logs when we really need them (we only use filebeat for applications logs currently, but intent to send kubernetes logs as well soon).
I've noted a minor inconsistency: looking a filebeat logs, which periodically announces some metrics, I've seen memstats.memory_alloc
, which is substantially lower than container_memory_working_set_bytes
in my monitoring. I think they are not measuring exactly the same thing ?
I needed, I can enabled pprof and run that for a few days to collect more data.
If that's not a leak, any ideas on how to prevent filebeat from growing its memory usage like this ?
Thanks for reading me.
Logs: I don't have logs for the filebeat which exhibited the first behavior, as they were two restart due to OOM. However I don't noticed anything out of the ordinary while looking at it, mostly "Metrics" lines. The last one on the current running pod are :
{"log.level":"info","@timestamp":"2024-03-12T11:30:28.162Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":41,"throttled":{"ns":615030897,"periods":1}}},"cpuacct":{"total":{"ns":628925660}},"memory":{"mem":{"usage":{"bytes":331001856}}}},"cpu":{"system":{"ticks":61130,"time":{"ms":10}},"total":{"ticks":953000,"time":{"ms":330},"value":953000},"user":{"ticks":891870,"time":{"ms":320}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44670057},"version":"8.12.2"},"memstats":{"gc_next":257757528,"memory_alloc":150620368,"memory_total":331671957272,"rss":397660160},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":0,"added":2,"done":4},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"acked":4,"active":0,"batches":2,"total":4}},"outputs":{"kafka":{"bytes_read":312,"bytes_write":3822}},"pipeline":{"clients":1,"events":{"active":0,"published":2,"total":2},"queue":{"acked":4}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.02,"15":0.16,"5":0.16,"norm":{"1":0.0025,"15":0.02,"5":0.02}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-03-12T11:30:58.162Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":34}},"cpuacct":{"total":{"ns":676148123}},"memory":{"mem":{"usage":{"bytes":344637440}}}},"cpu":{"system":{"ticks":61150,"time":{"ms":20}},"total":{"ticks":953390,"time":{"ms":390},"value":953390},"user":{"ticks":892240,"time":{"ms":370}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44700056},"version":"8.12.2"},"memstats":{"gc_next":263442976,"memory_alloc":188756712,"memory_total":331816014528,"rss":410624000},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":2,"added":2},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":2,"published":2,"total":2}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.01,"15":0.15,"5":0.14,"norm":{"1":0.0013,"15":0.0188,"5":0.0175}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-03-12T11:31:28.161Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":32,"throttled":{"ns":436049862,"periods":1}}},"cpuacct":{"total":{"ns":610060642}},"memory":{"mem":{"usage":{"bytes":314544128}}}},"cpu":{"system":{"ticks":61180,"time":{"ms":30}},"total":{"ticks":953720,"time":{"ms":330},"value":953720},"user":{"ticks":892540,"time":{"ms":300}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44730057},"version":"8.12.2"},"memstats":{"gc_next":232351944,"memory_alloc":192536696,"memory_total":331941757520,"rss":380456960},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":0,"added":2,"done":4},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"acked":4,"active":0,"batches":2,"total":4}},"outputs":{"kafka":{"bytes_read":312,"bytes_write":3813}},"pipeline":{"clients":1,"events":{"active":0,"published":2,"total":2},"queue":{"acked":4}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0,"15":0.14,"5":0.13,"norm":{"1":0,"15":0.0175,"5":0.0163}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-03-12T11:31:58.161Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":40,"throttled":{"ns":721637158,"periods":2}}},"cpuacct":{"total":{"ns":761081044}},"memory":{"mem":{"usage":{"bytes":324538368}}}},"cpu":{"system":{"ticks":61190,"time":{"ms":10}},"total":{"ticks":954180,"time":{"ms":460},"value":954180},"user":{"ticks":892990,"time":{"ms":450}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44760057},"version":"8.12.2"},"memstats":{"gc_next":254664584,"memory_alloc":131239520,"memory_total":332087599968,"rss":391262208},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":0,"added":2,"done":2},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"acked":2,"active":0,"batches":1,"total":2}},"outputs":{"kafka":{"bytes_read":156,"bytes_write":1912}},"pipeline":{"clients":1,"events":{"active":0,"published":2,"total":2},"queue":{"acked":2}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.16,"15":0.15,"5":0.15,"norm":{"1":0.02,"15":0.0188,"5":0.0188}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-03-12T11:32:28.161Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":32,"throttled":{"ns":531048983,"periods":1}}},"cpuacct":{"total":{"ns":685457701}},"memory":{"mem":{"usage":{"bytes":338583552}}}},"cpu":{"system":{"ticks":61220,"time":{"ms":30}},"total":{"ticks":954590,"time":{"ms":410},"value":954590},"user":{"ticks":893370,"time":{"ms":380}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44790058},"version":"8.12.2"},"memstats":{"gc_next":275733944,"memory_alloc":169929744,"memory_total":332231794168,"rss":405037056},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":2,"added":2},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":2,"published":2,"total":2}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.1,"15":0.14,"5":0.13,"norm":{"1":0.0125,"15":0.0175,"5":0.0163}}}},"ecs.version":"1.6.0"}}
I do have some errors logs related to the filestream id which could duplicates, but I don't have duplicates in the output.
{"log.level":"error","@timestamp":"2024-03-12T10:40:44.314Z","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.(*InputManager).Create","file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'kubernetes-container-logs-test-filebeat-55795d7b56-f6qvk-445a87fc969aa48e3bbc6dd5e60c318f2265f768c7f2ef8ef9c067df81685228' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
And finally I got a trace : 2024-03-12T06:44:22.493204589+01:00 Trace[445770726]: ---"Objects listed" error:<nil> 45067ms (05:44:22.492)
, not sure where this one came from...