I have observed that when Filebeat is shipping container logs which are rapidly growing in size, it reaches some threshold and then just stops shipping events from that logfile. This behaviour was initially suspected to be the result of log rotation, but persists even if log rotation is disabled.
My testing setup consists of a "log driver" (a small utility that just generates log lines), and Filebeat configured to autodiscover container logs and output them to Elasticsearch. This is run on Azure Kubernetes Service, so all workloads (Filebeat, ES, and our log driver) are containerized. The logging rate configured in our log driver is tuned to a rate that Filebeat cannot sustain, causing Filebeat to lag.
Here I'll share the setup and outcome of a single test session:
Filebeat version: 7.9.1
Elasticsearch version: 7.5.2
Log driver log generation rate: 2M events at 10k events per second
Max container log file size: 500MB
# List container log file (check inode + size)
root@aks-logdriver1-38647758-vmss000000:~# ls -ail /var/log/pods/logdriver_logdriver-core-g2wkf_99fe2aa2-e104-44c7-bf80-9dbcc0b38d68/java-gold/*
4391497 -rw-r----- 1 root root 396089742 Sep 28 20:14 /var/log/pods/logdriver_logdriver-core-g2wkf_99fe2aa2-e104-44c7-bf80-9dbcc0b38d68/java-gold/0.log
# Confirm 2M events were generated and written to log file
root@aks-logdriver1-38647758-vmss000000:~# wc -l /var/log/pods/logdriver_logdriver-core-g2wkf_99fe2aa2-e104-44c7-bf80-9dbcc0b38d68/java-gold/0.log
2000000 /var/log/pods/logdriver_logdriver-core-g2wkf_99fe2aa2-e104-44c7-bf80-9dbcc0b38d68/java-gold/0.log
Events received by Elasticsearch: 1,302,768
(No further events would be received by ES.)
Checking Filebeat's registry: 665 entries for the log driver container log file (too long to post here, but here's a snippet):
{"k":"filebeat::logs::native::4391497-2049","v":{"id":"native::4391497-2049","prev_id":"","offset":258007620,"FileStateOS":{"inode":4391497,"device":2049},"source":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log","timestamp":[792492084,1632860154],"ttl":-1,"type":"container","identifier_name":"native"}}
{"k":"filebeat::logs::native::4391497-2049","v":{"FileStateOS":{"device":2049,"inode":4391497},"id":"native::4391497-2049","prev_id":"","source":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log","ttl":-1,"offset":258007620,"timestamp":[792492084,1632860154],"type":"container","identifier_name":"native"}}
{"k":"filebeat::logs::native::4391497-2049","v":{"id":"native::4391497-2049","timestamp":[792492084,1632860154],"type":"container","FileStateOS":{"inode":4391497,"device":2049},"identifier_name":"native","prev_id":"","source":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log","offset":258007620,"ttl":-1}}
{"k":"filebeat::logs::native::4391497-2049","v":{"timestamp":[792492084,1632860154],"ttl":-1,"FileStateOS":{"inode":4391497,"device":2049},"identifier_name":"native","id":"native::4391497-2049","offset":258007620,"type":"container","prev_id":"","source":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log"}}
It's not clear to me why there are so many duplicate registry entries corresponding to the same file, and many duplicates for each offset. The largest offset is 258007620, which is far short of the 396089742 bytes in the container log file.
Filebeat logs:
2021-09-28T20:11:31.829Z INFO log/input.go:157 Configured paths: [/var/log/containers/*-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log]
2021-09-28T20:11:31.829Z INFO log/input.go:157 Configured paths: [/var/log/containers/*-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log]
2021-09-28T20:11:31.831Z INFO log/harvester.go:297 Harvester started for file: /var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log
2021-09-28T20:11:36.497Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":758550,"time":{"ms":79}},"total":{"ticks":2081300,"time":{"ms":1698},"value":2081300},"user":{"ticks":1322750,"time":{"ms":1619}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":17},"info":{"ephemeral_id":"c4768a0b-30ac-41c4-8194-7fdd83803d1b","uptime":{"ms":1214490095}},"memstats":{"gc_next":84897472,"memory_alloc":76794560,"memory_total":176682582872,"rss":38346752},"runtime":{"goroutines":159}},"filebeat":{"events":{"active":3946,"added":22888,"done":18942},"harvester":{"files":{"382da571-8b04-4ebd-a2ff-544f7c71feb6":{"last_event_published_time":"2021-09-28T20:11:33.768Z","last_event_timestamp":"2021-09-28T20:11:31.881Z","read_offset":6346,"size":30289},"c93dd7b4-6d01-4988-bf13-f1bba036321b":{"last_event_published_time":"2021-09-28T20:11:33.742Z","last_event_timestamp":"2021-09-28T20:11:31.834Z","read_offset":2279,"size":1397},"ebd9e815-c1a3-453b-9ae1-7fb4b8dc7fef":{"last_event_published_time":"2021-09-28T20:11:36.097Z","last_event_timestamp":"2021-09-28T20:11:33.919Z","name":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log","read_offset":4522961,"size":295814,"start_time":"2021-09-28T20:11:31.830Z"}},"open_files":3,"running":3,"started":1}},"libbeat":{"config":{"module":{"running":8,"starts":1}},"output":{"events":{"acked":18941,"active":4096,"batches":15,"total":23037}},"pipeline":{"clients":8,"events":{"active":4117,"filtered":1,"published":22886,"total":22888},"queue":{"acked":18941}}},"registrar":{"states":{"current":11,"update":18942},"writes":{"success":16,"total":16}},"system":{"load":{"1":0.16,"15":0.01,"5":0.03,"norm":{"1":0.08,"15":0.005,"5":0.015}}}}}}
The harvester is started, but is not inactivated or stopped until the log driver container is removed. The logdriver log file is not mentioned in Filebeat logs further.
Filebeat configuration (filebeat.yml
):
filebeat.autodiscover:
providers:
- type: kubernetes
templates:
- condition:
equals:
kubernetes.labels.core-filebeat-logging: "true"
config:
- type: container
multiline:
pattern: '^\['
negate: true
match: after
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
- condition:
not:
equals:
kubernetes.labels.core-filebeat-logging: "true"
config:
- type: container
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
processors:
- add_cloud_metadata: ~
- add_docker_metadata: ~
setup.template.settings:
setup.ilm.enabled: false
setup.template.enabled: false
setup.kibana:
host: "kibana:443"
protocol: "https"
output.elasticsearch:
hosts: ["elastic:443"]
protocol: "https"
pipeline: all_logs
username: elastic
password: "${ES_PASS}"
bulk_max_size: 3200
worker: 2
ttl: 100
pipelining: 0
template:
name: "filebeat"
path: "fields.yml"
overwrite: true
monitoring:
enabled: true
elasticsearch:
hosts: ["elastic:443"]
username: elastic
password: "${ES_PASS}"
Filebeat does not report any dropped events:
We are also monitoring Elasticsearch bulk rejections for signs of load on ES, and over the duration of this test period ES does not show any bulk rejections:
Any insight that could be shared into this problem would be greatly appreciated.