Filebeat opening two harvesters for the same file

Issue:
We're seeing occasional data loss in kibana. We'll get several lines from a given log file that filebeat grabs and successfully sends to elasticsearch, and then suddenly (and before we are finished processing the whole file) nothing, losing any number of lines from the end of the file.

Current stack:
I'm running filebeat as a daemonset on my kubernetes cluster to grab logs from all the pods running on said cluster. Here's the values file I'm using, which includes some info about the helm command I run to install filebeat, as well as what chart I'm pulling in. Note that values surrounded with {{}} are templatized values that are injected in dynamically by an external controller, before we make the helm call. We don't seem to have any issues starting up filebeat and sending logs in the general case, so I'm fairly confident the issue isn't in our find/replace code.

#############################################
# Filebeat Helm Chart Values
# Usage:
#   helm upgrade --install filebeat --namespace kube-system -f filebeat_values.yaml stable/filebeat
# Reference:
#   https://github.com/helm/charts/tree/master/stable/filebeat
image:
  tag: 7.6.0

overrideConfig:
  filebeat.config:
    modules:
      path: ${path.config}/modules.d/*.yml
      # Reload module configs as they change:
      reload.enabled: true

  filebeat.autodiscover:
    providers:
    - type: kubernetes
      hints.enabled: true
      templates:
      - condition:
          equals:
            kubernetes.namespace: "*"
        config:
          - type: container
            paths:
            - /var/log/containers/*-${data.kubernetes.container.id}.log
  processors:
  - add_kubernetes_metadata:
      in_cluster: true
  - drop_event:
      when:
        or:
          - contains:
              kubernetes.namespace: "kube"
          - contains:
              kubernetes.namespace: "default"
  - add_cloud_metadata:
  - add_fields:
      target: kubernetes
      fields:
        cluster_name: {{cluster_name}}
  output:
    elasticsearch:
      hosts: ["{{es_host}}"]
      bulk_max_size: 20
      protocol: https
      enabled: true
      username: elastic
      password: {{password}}
      index: "logs-%{[kubernetes.pod.labels.service-name]:filebeat}-%{+xxxx.ww}"
    file:
      enabled: false

  setup.template.name: "filebeat"
  setup.template.pattern: "logs-*"
  setup.ilm.enabled: false

monitoring:
  enabled: false
  serviceMonitor:
    enabled: false

tolerations:
- effect: NoExecute
  key: dedicated
  operator: Equal
  value: nginx

Logs:
So digging in the logs around the time we started and then stopped seeing lines from the specific log file, I've only noticed one potentially interesting thing. The file in question appears to have had two haversters started up for the file, created 14 ms apart (with no lines in between). The first and last lines I included don't seem to be relevant (But maybe I'm wrong about that).

2020-10-12T12:45:21.148Z        INFO    [monitoring]    log/log.go:145  Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":2575380,"time":{"ms":29}},"total":{"ticks":6474720,"time":{"ms":91},"value":6474720},"user":{"ticks":3899340,"time":{"ms":62}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":17},"info":{"ephemeral_id":"3bdf4371-acb1-4e2f-9e7c-1913bf4e4950","uptime":{"ms":919470244}},"memstats":{"gc_next":15899280,"memory_alloc":13588952,"memory_total":428778194016},"runtime":{"goroutines":136}},"filebeat":{"events":{"active":-22,"added":114,"done":136},"harvester":{"files":{"03138bf6-1116-485f-ba00-39f574fb4dae":{"last_event_published_time":"2020-10-12T12:45:20.717Z","last_event_timestamp":"2020-10-12T12:45:19.208Z","read_offset":978,"size":978},"45cbeab2-fef2-4ecc-be21-9b2d7d79d7d2":{"last_event_published_time":"2020-10-12T12:44:57.230Z","last_event_timestamp":"2020-10-12T12:44:54.213Z","read_offset":3837,"size":7681},"7d196083-44e5-4479-968d-5a446beafa7f":{"last_event_published_time":"2020-10-12T12:45:18.639Z","last_event_timestamp":"2020-10-12T12:45:17.075Z","read_offset":7510,"size":7214},"83c285ef-02fc-40eb-997d-aba08275d109":{"last_event_published_time":"2020-10-12T12:44:56.250Z","last_event_timestamp":"2020-10-12T12:44:51.149Z","read_offset":2696,"size":2696},"99387d61-1535-42f3-a092-b23e36bf5655":{"last_event_published_time":"2020-10-12T12:45:20.686Z","last_event_timestamp":"2020-10-12T12:45:19.984Z","read_offset":7370,"size":7665},"af35bf82-8573-4d30-930b-fd993d6df3fe":{"last_event_published_time":"2020-10-12T12:45:11.219Z","last_event_timestamp":"2020-10-12T12:45:04.090Z","read_offset":7684,"size":11551},"c631e75f-b240-4e9f-9645-5bdee1fbfdfd":{"last_event_published_time":"2020-10-12T12:45:17.821Z","last_event_timestamp":"2020-10-12T12:45:10.962Z","read_offset":1179,"size":1644}},"open_files":7,"running":7}},"libbeat":{"config":{"module":{"running":0},"reloads":3},"output":{"events":{"acked":134,"batches":19,"total":134},"read":{"bytes":10683},"write":{"bytes":225710}},"pipeline":{"clients":10,"events":{"active":4,"filtered":2,"published":112,"total":114},"queue":{"acked":134}}},"registrar":{"states":{"current":70,"update":136},"writes":{"success":21,"total":21}},"system":{"load":{"1":0.05,"15":0.13,"5":0.09,"norm":{"1":0.0125,"15":0.0325,"5":0.0225}}}}}}
2020-10-12T12:45:30.668Z        INFO    log/input.go:152        Configured paths: [/var/lib/docker/containers/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074/*-json.log]
2020-10-12T12:45:30.668Z        INFO    input/input.go:114      Starting input of type: container; ID: 10484647692728599411
2020-10-12T12:45:30.668Z        INFO    log/harvester.go:297    Harvester started for file: /var/lib/docker/containers/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074-json.log
2020-10-12T12:45:40.682Z        INFO    log/harvester.go:297    Harvester started for file: /var/lib/docker/containers/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074/afec019985abacfd24abe9a487ea067e65c2af0f21b8f0d7520d109566bea074-json.log
2020-10-12T12:45:43.783Z        INFO    input/input.go:149      input ticker stopped
2020-10-12T12:45:43.783Z        INFO    input/input.go:167      Stopping Input: 10484647692728599411
2020-10-12T12:45:51.149Z        INFO    [monitoring]    log/log.go:145  Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":2575850,"time":{"ms":468}},"total":{"ticks":6477450,"time":{"ms":2735},"value":6477450},"user":{"ticks":3901600,"time":{"ms":2267}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":19},"info":{"ephemeral_id":"3bdf4371-acb1-4e2f-9e7c-1913bf4e4950","uptime":{"ms":919500245}},"memstats":{"gc_next":36009552,"memory_alloc":28497544,"memory_total":429033812184,"rss":3411968},"runtime":{"goroutines":158}},"filebeat":{"events":{"active":35,"added":13058,"done":13023},"harvester":{"closed":2,"files":{"03138bf6-1116-485f-ba00-39f574fb4dae":{"last_event_published_time":"2020-10-12T12:45:50.782Z","last_event_timestamp":"2020-10-12T12:45:49.208Z","read_offset":975,"size":976},"224b2748-ff8d-4f10-858f-7beca4fc5243":{"last_event_published_time":"","last_event_timestamp":"","name":"/var/lib/docker/containers/cfb93a48a3cb980627b5720d4d14a8d89477bfc1e918f5573f7f80c04729c34b/cfb93a48a3cb980627b5720d4d14a8d89477bfc1e918f5573f7f80c04729c34b-json.log","start_time":"2020-10-12T12:45:50.180Z"},"45cbeab2-fef2-4ecc-be21-9b2d7d79d7d2":{"last_event_published_time":"2020-10-12T12:45:50.236Z","last_event_timestamp":"2020-10-12T12:45:40.831Z","read_offset":11550,"size":7704},"7d196083-44e5-4479-968d-5a446beafa7f":{"last_event_published_time":"2020-10-12T12:45:45.645Z","last_event_timestamp":"2020-10-12T12:45:45.497Z","read_offset":7507,"size":7803},"83c285ef-02fc-40eb-997d-aba08275d109":{"last_event_published_time":"2020-10-12T12:45:44.255Z","last_event_timestamp":"2020-10-12T12:45:43.785Z","read_offset":4140,"size":3484},"99387d61-1535-42f3-a092-b23e36bf5655":{"last_event_published_time":"2020-10-12T12:45:50.777Z","last_event_timestamp":"2020-10-12T12:45:49.984Z","read_offset":7319,"size":7074},"af35bf82-8573-4d30-930b-fd993d6df3fe":{"last_event_published_time":"2020-10-12T12:45:36.407Z","last_event_timestamp":"2020-10-12T12:45:34.092Z","read_offset":3834,"size":3834},"c631e75f-b240-4e9f-9645-5bdee1fbfdfd":{"last_event_published_time":"2020-10-12T12:45:42.822Z","last_event_timestamp":"2020-10-12T12:45:40.962Z","read_offset":1692,"size":1179},"dc12996d-2233-4ae0-84a6-4f5393a6cc08":{"last_event_published_time":"2020-10-12T12:45:50.182Z","last_event_timestamp":"2020-10-12T12:45:50.043Z","name":"/var/lib/docker/containers/b5a505b2c7d387d86053be6bffb4230b87e960d44beaadf6970f2f4373d0238f/b5a505b2c7d387d86053be6bffb4230b87e960d44beaadf6970f2f4373d0238f-json.log","read_offset":2594,"size":2594,"start_time":"2020-10-12T12:45:50.180Z"}},"open_files":9,"running":9,"started":4}},"libbeat":{"config":{"module":{"running":0},"reloads":3},"output":{"events":{"acked":13009,"batches":657,"total":13009},"read":{"bytes":439677,"errors":7},"write":{"bytes":22211244}},"pipeline":{"clients":12,"events":{"active":39,"filtered":14,"published":13044,"total":13058},"queue":{"acked":13009}}},"registrar":{"states":{"current":74,"update":13023},"writes":{"success":668,"total":668}},"system":{"load":{"1":0.43,"15":0.16,"5":0.18,"norm":{"1":0.1075,"15":0.04,"5":0.045}}}}}}

Issues ruled out:
Network connectivity - While this still could be the problem, we've done quite a bit to solidify the network connection piece to make that as stable as possible. We aren't seeing any of the (previously seen) indicators of networking issues like connection resets or HTTP 413 body too large errors, but it's always a possibility I guess. Any ideas on where (or how) to look further would be appreciated!

Log file getting deleted before filebeat finishes - In the filebeat logs above, input ID 10484647692728599411 started, opened two harvesters on the same file, and then closed. From open to close it took 13 seconds and some change. We keep our pods around for a minimum of 10 minutes, and even if we were deleting the file I'd expect filebeat to still read the file via the file descriptor anyway. We know we're opening the file, as we see the first N lines, so barring some weirdly consistent file descriptor issue, I think we can safely assume the logs aren't disappearing before filebeat can get to them.

Hypothesis:
I'm thinking some bug is causing filebeat to open two harvesters for the same file, and that with two harvesters reading the same file, we have an awkward race condition that causes us to terminate the input for that file before all lines have been successfully sent to elasticsearch. Unfortunately I don't know nearly enough about filebeat inputs and harvesters to really state that with certainty (or investigate myself).

So far the only reports I've had for this issue have been from pods run as kubernetes jobs. I can't say with confidence that that's the only place we're having this issue, only that it's the only place folks are currently noticing missing logs. It might be a worthwhile starting point though.

Is this sounding like a known issue that has a known workaround? A bug that's been fixed in a more recent version of filebeat? Is this my just desserts for using kubernetes?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.