Filebeat stops shipping events under load

I have observed that when Filebeat is shipping container logs which are rapidly growing in size, it reaches some threshold and then just stops shipping events from that logfile. This behaviour was initially suspected to be the result of log rotation, but persists even if log rotation is disabled.

My testing setup consists of a "log driver" (a small utility that just generates log lines), and Filebeat configured to autodiscover container logs and output them to Elasticsearch. This is run on Azure Kubernetes Service, so all workloads (Filebeat, ES, and our log driver) are containerized. The logging rate configured in our log driver is tuned to a rate that Filebeat cannot sustain, causing Filebeat to lag.

Here I'll share the setup and outcome of a single test session:

Filebeat version: 7.9.1
Elasticsearch version: 7.5.2
Log driver log generation rate: 2M events at 10k events per second
Max container log file size: 500MB

# List container log file (check inode + size)
root@aks-logdriver1-38647758-vmss000000:~# ls -ail /var/log/pods/logdriver_logdriver-core-g2wkf_99fe2aa2-e104-44c7-bf80-9dbcc0b38d68/java-gold/*
4391497 -rw-r----- 1 root root 396089742 Sep 28 20:14 /var/log/pods/logdriver_logdriver-core-g2wkf_99fe2aa2-e104-44c7-bf80-9dbcc0b38d68/java-gold/0.log

# Confirm 2M events were generated and written to log file
root@aks-logdriver1-38647758-vmss000000:~# wc -l /var/log/pods/logdriver_logdriver-core-g2wkf_99fe2aa2-e104-44c7-bf80-9dbcc0b38d68/java-gold/0.log
2000000 /var/log/pods/logdriver_logdriver-core-g2wkf_99fe2aa2-e104-44c7-bf80-9dbcc0b38d68/java-gold/0.log

Events received by Elasticsearch: 1,302,768


(No further events would be received by ES.)

Checking Filebeat's registry: 665 entries for the log driver container log file (too long to post here, but here's a snippet):

{"k":"filebeat::logs::native::4391497-2049","v":{"id":"native::4391497-2049","prev_id":"","offset":258007620,"FileStateOS":{"inode":4391497,"device":2049},"source":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log","timestamp":[792492084,1632860154],"ttl":-1,"type":"container","identifier_name":"native"}}
{"k":"filebeat::logs::native::4391497-2049","v":{"FileStateOS":{"device":2049,"inode":4391497},"id":"native::4391497-2049","prev_id":"","source":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log","ttl":-1,"offset":258007620,"timestamp":[792492084,1632860154],"type":"container","identifier_name":"native"}}
{"k":"filebeat::logs::native::4391497-2049","v":{"id":"native::4391497-2049","timestamp":[792492084,1632860154],"type":"container","FileStateOS":{"inode":4391497,"device":2049},"identifier_name":"native","prev_id":"","source":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log","offset":258007620,"ttl":-1}}
{"k":"filebeat::logs::native::4391497-2049","v":{"timestamp":[792492084,1632860154],"ttl":-1,"FileStateOS":{"inode":4391497,"device":2049},"identifier_name":"native","id":"native::4391497-2049","offset":258007620,"type":"container","prev_id":"","source":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log"}}

It's not clear to me why there are so many duplicate registry entries corresponding to the same file, and many duplicates for each offset. The largest offset is 258007620, which is far short of the 396089742 bytes in the container log file.

Filebeat logs:

2021-09-28T20:11:31.829Z	INFO	log/input.go:157	Configured paths: [/var/log/containers/*-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log]
2021-09-28T20:11:31.829Z	INFO	log/input.go:157	Configured paths: [/var/log/containers/*-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log]
2021-09-28T20:11:31.831Z	INFO	log/harvester.go:297	Harvester started for file: /var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log
2021-09-28T20:11:36.497Z	INFO	[monitoring]	log/log.go:145	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":758550,"time":{"ms":79}},"total":{"ticks":2081300,"time":{"ms":1698},"value":2081300},"user":{"ticks":1322750,"time":{"ms":1619}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":17},"info":{"ephemeral_id":"c4768a0b-30ac-41c4-8194-7fdd83803d1b","uptime":{"ms":1214490095}},"memstats":{"gc_next":84897472,"memory_alloc":76794560,"memory_total":176682582872,"rss":38346752},"runtime":{"goroutines":159}},"filebeat":{"events":{"active":3946,"added":22888,"done":18942},"harvester":{"files":{"382da571-8b04-4ebd-a2ff-544f7c71feb6":{"last_event_published_time":"2021-09-28T20:11:33.768Z","last_event_timestamp":"2021-09-28T20:11:31.881Z","read_offset":6346,"size":30289},"c93dd7b4-6d01-4988-bf13-f1bba036321b":{"last_event_published_time":"2021-09-28T20:11:33.742Z","last_event_timestamp":"2021-09-28T20:11:31.834Z","read_offset":2279,"size":1397},"ebd9e815-c1a3-453b-9ae1-7fb4b8dc7fef":{"last_event_published_time":"2021-09-28T20:11:36.097Z","last_event_timestamp":"2021-09-28T20:11:33.919Z","name":"/var/log/containers/logdriver-core-g2wkf_logdriver_java-gold-4445ee10dc3179c85c6a47c52b0c5cb962ad0b66d4c6ecca069de630d37d4e74.log","read_offset":4522961,"size":295814,"start_time":"2021-09-28T20:11:31.830Z"}},"open_files":3,"running":3,"started":1}},"libbeat":{"config":{"module":{"running":8,"starts":1}},"output":{"events":{"acked":18941,"active":4096,"batches":15,"total":23037}},"pipeline":{"clients":8,"events":{"active":4117,"filtered":1,"published":22886,"total":22888},"queue":{"acked":18941}}},"registrar":{"states":{"current":11,"update":18942},"writes":{"success":16,"total":16}},"system":{"load":{"1":0.16,"15":0.01,"5":0.03,"norm":{"1":0.08,"15":0.005,"5":0.015}}}}}}

The harvester is started, but is not inactivated or stopped until the log driver container is removed. The logdriver log file is not mentioned in Filebeat logs further.

Filebeat configuration (filebeat.yml):

filebeat.autodiscover:
  providers:
    - type: kubernetes
      templates:
        - condition:
            equals:
              kubernetes.labels.core-filebeat-logging: "true"
          config:
            - type: container
              multiline:
                pattern: '^\['
                negate:  true
                match:   after
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log
        - condition:
            not:
              equals:
                kubernetes.labels.core-filebeat-logging: "true"
          config:
            - type: container
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log

processors:
  - add_cloud_metadata: ~
  - add_docker_metadata: ~

setup.template.settings:
  setup.ilm.enabled: false
  setup.template.enabled: false
setup.kibana:
  host: "kibana:443"
  protocol: "https"
output.elasticsearch:
  hosts: ["elastic:443"]
  protocol: "https"
  pipeline: all_logs
  username: elastic
  password: "${ES_PASS}"
  bulk_max_size: 3200
  worker: 2

  ttl: 100
  pipelining: 0

template:
  name: "filebeat"
  path: "fields.yml"
  overwrite: true
monitoring:
  enabled: true
  elasticsearch:
    hosts: ["elastic:443"]
    username: elastic
    password: "${ES_PASS}"

Filebeat does not report any dropped events:

We are also monitoring Elasticsearch bulk rejections for signs of load on ES, and over the duration of this test period ES does not show any bulk rejections:

Any insight that could be shared into this problem would be greatly appreciated.

Are you able to upgrade?
7.5 Elasticsearch is EOL, and 7.9 Filebeat is a little old.

Upgrading the stack is in the plans, but won't be carried out for awhile. I'm aware our stack is a dated.