Filebeat queue.disk keeps piling up even when Logstash persisted queue remains relatively empty

Hi!

So we are using the following chain:

Filebeats, that run in a K8s cluster (1 Filebeat instance on each k8s worker node) ->
2 Logstash nodes behind AWS ALB ->
Elastic search cluster

Everything works pretty well. But when the log volume is high quite often files in Filebeat queue.disk folder start piling up till we reach queue.disk.max_size .

Don't see any CPU iowaits on K8s nodes, plenty of CPU resources available. No CPU/Memory limits for Filebeat pod.

At the same time on both Logstash nodes persisted queue remains relatively empty. Logstash nodes are not loaded (CPU is mostly half idle) .

logstash.yml

path.data: /var/lib/logstash
pipeline.workers: 33
# we played with different batch sizes values here
pipeline.batch.size: 131072
path.config: /etc/logstash/conf.d
queue.type: persisted
queue.max_bytes: 310gb
dead_letter_queue.enable: true
dead_letter_queue.max_bytes: 35gb
path.dead_letter_queue: /var/lib/logstash/dead_letter_queue
path.logs: /var/log/logstash
log.level: info
http.host: 0.0.0.0

Logstash jvm.options:

-Xms60g
-Xmx60g
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djruby.compile.invokedynamic=true
-Djruby.jit.threshold=0
-Djruby.regexp.interruptible=true
-XX:+HeapDumpOnOutOfMemoryError
-Djava.security.egd=file:/dev/urandom

filebeat.yml

filebeat.inputs:

- type: container
  stream: all
  paths:
    - "/var/log/containers/*.log"
  multiline.type: pattern
  multiline.pattern: '^(\d{4})'
  multiline.negate: true
  multiline.match: after

processors:

- add_kubernetes_metadata:
    default_indexers.enabled: false
    default_matchers.enabled: false
    indexers:
      - container:
    matchers:
      - logs_path:
          logs_path: '/var/log/containers/'
          resource_type: 'container'

- drop_event:
    when:
      not:
        has_fields: ['kubernetes.labels.log-format']

output.logstash:
  hosts: logstash-nlb:5044

  loadbalance: false

  compression_level: 0

  pipelining: 5

  # tried different values from 256 to 8192
  bulk_max_size : 1024
  slow_start: false
  # tried different values from 1 to 6
  workers: 6

queue.disk:
  max_size: 25GB
  path: /usr/share/filebeat/data/queue/
  segment_size: 1MB

http.enabled: true
http.host: 0.0.0.0

I noticed that each Filebeat instance is sending logstream to Logstash nodes at a speed of around 5-15 MB/sec max. Which in our case looks to be not enough to keep up with our apps' logs.

Can anyone please help me with understanding where to look for the bottleneck? I expect Filbeats to send their buffer as fast as possible to Logstash nodes and then let them do all the heavy lifting with queuing, parsing, etc.

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.