Filebeat reliably dropping lines from head of first new files it sees during load testing

We are running a logging load test on our filebeat configuration and have observed that it is reliably dropping a small percentage of lines from the head of the first files it is seeing.

Our Setup

Filebeat is deployed to a 3 node GKE cluster as a DaemonSet. GKE is rotating our files every 10MB.

Our log load test is setup as follows (config at the end of the post):

  • Deployed as a single Kubernetes job run on each node for a total of 3 jobs.
  • Each job has 10 containers, which creates 10 log files for filebeat to tail on each node.
  • Each container sleeps randomly between 0-10 seconds, after which it starts logging at a rate of 550 lines/second until it reaches 99k lines.
  • This results in an overall expected ingestion rate for a single filebeat of 5.5k lines/s, and a total of 990,000 lines ingested per beat.
  • The expected ingestion total for the load test is 2,970,000 lines.

We have observed that a reliably small number of log lines have failed to be ingested (somewhere in the neighborhood of 0.004% or 80-200 lines). After some parsing of the resulting data, we were able to determine that dropped lines were coming from the head of the initial file created by the load test. (TLDR, we figured this out by inserting a total=X string in our log message and setting an expectation that there would be 30 total=1 lines, 30 total=2 lines all the way up to 30 total=99000 lines. Again, 30 containers emitting 99k lines).

It is also interesting that, after that initial drop, filebeat did not miss a single line for the remainder of the test.

I really can't believe that the events are being dropped from the head of the first file generated by the load test (I expected it would be at the tail of last file), but here we are.

Has anybody else experience this before and been able to track down a solution?


Our filebeat config file:

logging:
  level: info
  to_syslog: true

http:
  enabled: true

queue.mem:
  events: 90000 # flush.events * 15s
  flush.timeout: 5s
  flush.min_events: 6000 # bulk_max_size * workers

processors:
  - add_cloud_metadata:
      overwrite: true

filebeat.inputs:
  - type: docker
    enabled: true
    containers.ids:
      - '*'
    exclude_files:
      - 'filebeat'
      - 'POD'
      - 'kube-system'

    scan_frequency: 2s   # how often to scan for new files
    close_inactive: 90s  # close harvesters after file inactivity
    close_removed: true  # https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html#filebeat-input-log-close-removed
    clean_removed: true  # https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html#filebeat-input-log-clean-removed
    harvester_limit: 80  # File handler limit

    processors:
      - add_kubernetes_metadata:
          in_cluster: true

      - drop_event:
          when:
            not:
              equals:
                kubernetes.labels.app: "logload"

      # Decode structured logs
      - decode_json_fields:
          fields: ["message"]
          target: "" # move to root of document

      - rename:
          ignore_missing: true
          fail_on_error: false
          fields:
            - from: "level"
              to: "severity"

Solved: the add_kubernetes_metadata processors silently not tagging lines from files it doesn't have metadata about yet, so our load test was dropping them on the floor.

I thought it would at least write an error or emit a a debug line but I haven't seen anything.

1 Like

FWIW: https://github.com/elastic/beats/issues/6042