Hi!
So we are using the following chain:
Filebeats, that run in a K8s cluster (1 Filebeat instance on each k8s worker node) ->
2 Logstash nodes behind AWS ALB ->
Elastic search cluster
Everything works pretty well. But when the log volume is high quite often files in Filebeat queue.disk folder start piling up till we reach queue.disk.max_size .
Don't see any CPU iowaits on K8s nodes, plenty of CPU resources available. No CPU/Memory limits for Filebeat pod.
At the same time on both Logstash nodes persisted queue remains relatively empty. Logstash nodes are not loaded (CPU is mostly half idle) .
logstash.yml
path.data: /var/lib/logstash
pipeline.workers: 33
# we played with different batch sizes values here
pipeline.batch.size: 131072
path.config: /etc/logstash/conf.d
queue.type: persisted
queue.max_bytes: 310gb
dead_letter_queue.enable: true
dead_letter_queue.max_bytes: 35gb
path.dead_letter_queue: /var/lib/logstash/dead_letter_queue
path.logs: /var/log/logstash
log.level: info
http.host: 0.0.0.0
Logstash jvm.options:
-Xms60g
-Xmx60g
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djruby.compile.invokedynamic=true
-Djruby.jit.threshold=0
-Djruby.regexp.interruptible=true
-XX:+HeapDumpOnOutOfMemoryError
-Djava.security.egd=file:/dev/urandom
filebeat.yml
filebeat.inputs:
- type: container
stream: all
paths:
- "/var/log/containers/*.log"
multiline.type: pattern
multiline.pattern: '^(\d{4})'
multiline.negate: true
multiline.match: after
processors:
- add_kubernetes_metadata:
default_indexers.enabled: false
default_matchers.enabled: false
indexers:
- container:
matchers:
- logs_path:
logs_path: '/var/log/containers/'
resource_type: 'container'
- drop_event:
when:
not:
has_fields: ['kubernetes.labels.log-format']
output.logstash:
hosts: logstash-nlb:5044
loadbalance: false
compression_level: 0
pipelining: 5
# tried different values from 256 to 8192
bulk_max_size : 1024
slow_start: false
# tried different values from 1 to 6
workers: 6
queue.disk:
max_size: 25GB
path: /usr/share/filebeat/data/queue/
segment_size: 1MB
http.enabled: true
http.host: 0.0.0.0
I noticed that each Filebeat instance is sending logstream to Logstash nodes at a speed of around 5-15 MB/sec max. Which in our case looks to be not enough to keep up with our apps' logs.
Can anyone please help me with understanding where to look for the bottleneck? I expect Filbeats to send their buffer as fast as possible to Logstash nodes and then let them do all the heavy lifting with queuing, parsing, etc.
Thanks!