Issue:
We're seeing occasional data loss in kibana. We'll get several lines from a given log file that filebeat grabs and successfully sends to elasticsearch, and then suddenly (and before we are finished processing the whole file) nothing, losing any number of lines from the end of the file.
Current stack:
I'm running filebeat as a daemonset on my kubernetes cluster to grab logs from all the pods running on said cluster. Here's the values file I'm using, which includes some info about the helm command I run to install filebeat, as well as what chart I'm pulling in. Note that values surrounded with {{}} are templatized values that are injected in dynamically by an external controller, before we make the helm call. We don't seem to have any issues starting up filebeat and sending logs in the general case, so I'm fairly confident the issue isn't in our find/replace code.
#############################################
# Filebeat Helm Chart Values
# Usage:
# helm upgrade --install filebeat --namespace kube-system -f filebeat_values.yaml stable/filebeat
# Reference:
# https://github.com/helm/charts/tree/master/stable/filebeat
image:
tag: 7.6.0
overrideConfig:
filebeat.config:
modules:
path: ${path.config}/modules.d/*.yml
# Reload module configs as they change:
reload.enabled: true
filebeat.autodiscover:
providers:
- type: kubernetes
hints.enabled: true
templates:
- condition:
equals:
kubernetes.namespace: "*"
config:
- type: container
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
processors:
- add_kubernetes_metadata:
in_cluster: true
- drop_event:
when:
or:
- contains:
kubernetes.namespace: "kube"
- contains:
kubernetes.namespace: "default"
- add_cloud_metadata:
- add_fields:
target: kubernetes
fields:
cluster_name: {{cluster_name}}
output:
elasticsearch:
hosts: ["{{es_host}}"]
bulk_max_size: 20
protocol: https
enabled: true
username: elastic
password: {{password}}
index: "logs-%{[kubernetes.pod.labels.service-name]:filebeat}-%{+xxxx.ww}"
file:
enabled: false
setup.template.name: "filebeat"
setup.template.pattern: "logs-*"
setup.ilm.enabled: false
monitoring:
enabled: false
serviceMonitor:
enabled: false
tolerations:
- effect: NoExecute
key: dedicated
operator: Equal
value: nginx
Logs:
So digging in the logs around the time we started and then stopped seeing lines from the specific log file, I've only noticed one potentially interesting thing. The file in question appears to have had two haversters started up for the file, created 14 ms apart (with no lines in between). The first and last lines I included don't seem to be relevant (But maybe I'm wrong about that).
2020-10-12T12:45:21.148Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":2575380,"time":{"ms":29}},"total":{"ticks":6474720,"time":{"ms":91},"value":6474720},"user":{"ticks":3899340,"time":{"ms":62}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":17},"info":{"ephemeral_id":"3bdf4371-acb1-4e2f-9e7c-1913bf4e4950","uptime":{"ms":919470244}},"memstats":{"gc_next":15899280,"memory_alloc":13588952,"memory_total":428778194016},"runtime":{"goroutines":136}},"filebeat":{"events":{"active":-22,"added":114,"done":136},"harvester":{"files":{"03138bf6-1116-485f-ba00-39f574fb4dae":{"last_event_published_time":"2020-10-12T12:45:20.717Z","last_event_timestamp":"2020-10-12T12:45:19.208Z","read_offset":978,"size":978},"45cbeab2-fef2-4ecc-be21-9b2d7d79d7d2":{"last_event_published_time":"2020-10-12T12:44:57.230Z","last_event_timestamp":"2020-10-12T12:44:54.213Z","read_offset":3837,"size":7681},"7d196083-44e5-4479-968d-5a446beafa7f":{"last_event_published_time":"2020-10-12T12:45:18.639Z","last_event_timestamp":"2020-10-12T12:45:17.075Z","read_offset":7510,"size":7214},"83c285ef-02fc-40eb-997d-aba08275d109":{"last_event_published_time":"2020-10-12T12:44:56.250Z","last_event_timestamp":"2020-10-12T12:44:51.149Z","read_offset":2696,"size":2696},"99387d61-1535-42f3-a092-b23e36bf5655":{"last_event_published_time":"2020-10-12T12:45:20.686Z","last_event_timestamp":"2020-10-12T12:45:19.984Z","read_offset":7370,"size":7665},"af35bf82-8573-4d30-930b-fd993d6df3fe":{"last_event_published_time":"2020-10-12T12:45:11.219Z","last_event_timestamp":"2020-10-12T12:45:04.090Z","read_offset":7684,"size":11551},"c631e75f-b240-4e9f-9645-5bdee1fbfdfd":{"last_event_published_time":"2020-10-12T12:45:17.821Z","last_event_timestamp":"2020-10-12T12:45:10.962Z","read_offset":1179,"size":1644}},"open_files":7,"running":7}},"libbeat":{"config":{"module":{"running":0},"reloads":3},"output":{"events":{"acked":134,"batches":19,"total":134},"read":{"bytes":10683},"write":{"bytes":225710}},"pipeline":{"clients":10,"events":{"active":4,"filtered":2,"published":112,"total":114},"queue":{"acked":134}}},"registrar":{"states":{"current":70,"update":136},"writes":{"success":21,"total":21}},"system":{"load":{"1":0.05,"15":0.13,"5":0.09,"norm":{"1":0.0125,"15":0.0325,"5":0.0225}}}}}}
2020-10-12T12:45:30.668Z INFO log/input.go:152 Configured paths: [/var/lib/docker/containers/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074/*-json.log]
2020-10-12T12:45:30.668Z INFO input/input.go:114 Starting input of type: container; ID: 10484647692728599411
2020-10-12T12:45:30.668Z INFO log/harvester.go:297 Harvester started for file: /var/lib/docker/containers/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074-json.log
2020-10-12T12:45:40.682Z INFO log/harvester.go:297 Harvester started for file: /var/lib/docker/containers/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074-json.log
2020-10-12T12:45:43.783Z INFO input/input.go:149 input ticker stopped
2020-10-12T12:45:43.783Z INFO input/input.go:167 Stopping Input: 10484647692728599411
2020-10-12T12:45:51.149Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":2575850,"time":{"ms":468}},"total":{"ticks":6477450,"time":{"ms":2735},"value":6477450},"user":{"ticks":3901600,"time":{"ms":2267}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":19},"info":{"ephemeral_id":"3bdf4371-acb1-4e2f-9e7c-1913bf4e4950","uptime":{"ms":919500245}},"memstats":{"gc_next":36009552,"memory_alloc":28497544,"memory_total":429033812184,"rss":3411968},"runtime":{"goroutines":158}},"filebeat":{"events":{"active":35,"added":13058,"done":13023},"harvester":{"closed":2,"files":{"03138bf6-1116-485f-ba00-39f574fb4dae":{"last_event_published_time":"2020-10-12T12:45:50.782Z","last_event_timestamp":"2020-10-12T12:45:49.208Z","read_offset":975,"size":976},"224b2748-ff8d-4f10-858f-7beca4fc5243":{"last_event_published_time":"","last_event_timestamp":"","name":"/var/lib/docker/containers/cfb93a48a3cb980627b5720d4d14a8d89477bfc1e918f5573f7f80c04729c34b/cfb93a48a3cb980627b5720d4d14a8d89477bfc1e918f5573f7f80c04729c34b-json.log","start_time":"2020-10-12T12:45:50.180Z"},"45cbeab2-fef2-4ecc-be21-9b2d7d79d7d2":{"last_event_published_time":"2020-10-12T12:45:50.236Z","last_event_timestamp":"2020-10-12T12:45:40.831Z","read_offset":11550,"size":7704},"7d196083-44e5-4479-968d-5a446beafa7f":{"last_event_published_time":"2020-10-12T12:45:45.645Z","last_event_timestamp":"2020-10-12T12:45:45.497Z","read_offset":7507,"size":7803},"83c285ef-02fc-40eb-997d-aba08275d109":{"last_event_published_time":"2020-10-12T12:45:44.255Z","last_event_timestamp":"2020-10-12T12:45:43.785Z","read_offset":4140,"size":3484},"99387d61-1535-42f3-a092-b23e36bf5655":{"last_event_published_time":"2020-10-12T12:45:50.777Z","last_event_timestamp":"2020-10-12T12:45:49.984Z","read_offset":7319,"size":7074},"af35bf82-8573-4d30-930b-fd993d6df3fe":{"last_event_published_time":"2020-10-12T12:45:36.407Z","last_event_timestamp":"2020-10-12T12:45:34.092Z","read_offset":3834,"size":3834},"c631e75f-b240-4e9f-9645-5bdee1fbfdfd":{"last_event_published_time":"2020-10-12T12:45:42.822Z","last_event_timestamp":"2020-10-12T12:45:40.962Z","read_offset":1692,"size":1179},"dc12996d-2233-4ae0-84a6-4f5393a6cc08":{"last_event_published_time":"2020-10-12T12:45:50.182Z","last_event_timestamp":"2020-10-12T12:45:50.043Z","name":"/var/lib/docker/containers/b5a505b2c7d387d86053be6bffb4230b87e960d44beaadf6970f2f4373d0238f/b5a505b2c7d387d86053be6bffb4230b87e960d44beaadf6970f2f4373d0238f-json.log","read_offset":2594,"size":2594,"start_time":"2020-10-12T12:45:50.180Z"}},"open_files":9,"running":9,"started":4}},"libbeat":{"config":{"module":{"running":0},"reloads":3},"output":{"events":{"acked":13009,"batches":657,"total":13009},"read":{"bytes":439677,"errors":7},"write":{"bytes":22211244}},"pipeline":{"clients":12,"events":{"active":39,"filtered":14,"published":13044,"total":13058},"queue":{"acked":13009}}},"registrar":{"states":{"current":74,"update":13023},"writes":{"success":668,"total":668}},"system":{"load":{"1":0.43,"15":0.16,"5":0.18,"norm":{"1":0.1075,"15":0.04,"5":0.045}}}}}}
Issues ruled out:
Network connectivity - While this still could be the problem, we've done quite a bit to solidify the network connection piece to make that as stable as possible. We aren't seeing any of the (previously seen) indicators of networking issues like connection resets or HTTP 413 body too large errors, but it's always a possibility I guess. Any ideas on where (or how) to look further would be appreciated!
Log file getting deleted before filebeat finishes - In the filebeat logs above, input ID 10484647692728599411 started, opened two harvesters on the same file, and then closed. From open to close it took 13 seconds and some change. We keep our pods around for a minimum of 10 minutes, and even if we were deleting the file I'd expect filebeat to still read the file via the file descriptor anyway. We know we're opening the file, as we see the first N lines, so barring some weirdly consistent file descriptor issue, I think we can safely assume the logs aren't disappearing before filebeat can get to them.
Hypothesis:
I'm thinking some bug is causing filebeat to open two harvesters for the same file, and that with two harvesters reading the same file, we have an awkward race condition that causes us to terminate the input for that file before all lines have been successfully sent to elasticsearch. Unfortunately I don't know nearly enough about filebeat inputs and harvesters to really state that with certainty (or investigate myself).
So far the only reports I've had for this issue have been from pods run as kubernetes jobs. I can't say with confidence that that's the only place we're having this issue, only that it's the only place folks are currently noticing missing logs. It might be a worthwhile starting point though.
Is this sounding like a known issue that has a known workaround? A bug that's been fixed in a more recent version of filebeat? Is this my just desserts for using kubernetes?