Filebeat/Logstash poor disk.queue read performance: is that the maximum i can get?

Testing performance of Filebeat disk.queue.

Environment: AWS EC2
OS: Centos 7.x
Machine parameters (FB, LS): CPU 8 vcores, RAM 16GB
Filebeat version: filebeat-8.10.3-1.x86_64
Logstash version: logstash-8.10.3-1.x86_64

1st test:

Got existing file segments of queue.disk from a production machine: 24GB in total, segment size 1MB (24973 segment files in total) and tried to consume all of them on a standalone Filebeat node with output to console:

filebeat.yml:

filebeat.inputs:
- type: filestream
  id: my-filestream-id
  paths:
    - /tmp/*.log
output.console:
  enabled: true
queue.disk:
  max_size: 25GB
  path: /var/lib/filebeat/data/queue
  segment_size: 1MB
http.enabled: true
http.host: 127.0.0.1

Got the following results:

  • Filebeat queue files consumption rate:
    60-64 queue.disk segment files/sec or 60-64 MB/sec

  • Filebeat CPU consumption (max 800%: 8 vcores): ~ 230%

  • Didn't notice any IOwaits

2nd test:

4 Filebeat machines that consume same set of queue.disk segment files each with output to a standalone Logstash node. Logstash then outputs everything to /dev/null.

Filebeat

filebeat.yml:

filebeat.inputs:
- type: filestream
  id: my-filestream-id
  paths:
    - /tmp/*.log
output.logstash:
  hosts: ["logs-ls4:5044"]
  loadbalance: true
  bulk_max_size : 8192
  workers: 8
queue.disk:
  max_size: 25GB
  path: /var/lib/filebeat/data/queue
  segment_size: 1MB
  # read_ahead: 1024
http.enabled: true
http.host: 127.0.0.1

Logstash

logstash.yml:

path.data: /var/lib/logstash
pipeline.workers: 6
pipeline.batch.size: 128 
path.config: /etc/logstash/conf.d
queue.type: persisted
queue.max_bytes: 310gb
dead_letter_queue.enable: true
dead_letter_queue.max_bytes: 2048mb
path.dead_letter_queue: /var/lib/logstash/dead_letter_queue
path.logs: /var/log/logstash
log.level: info

conf.d/10-io.conf:

input {
  beats {
    port => 5044
    ssl => false
  }
}
input {
  beats {
    port => 5045
    ssl => false
  }
}
output {
  file {
    path => "/dev/null"
  }
}

Got the following results:

  • Filebeat queue files consumption rate (each FB node):
    1 Filebeat nodes active: 16-18 queue.disk segment files/sec or 16-18 MB/sec
    2 Filebeat nodes active: 12 queue.disk segment files/sec or 12 MB/sec
    3 Filebeat nodes active: 7-10 queue.disk segment files/sec or 7-10 MB/sec
    4 Filebeat nodes active: 7-10 queue.disk segment files/sec or 7-10 MB/sec

  • Filebeat CPU consumtion (max 800%: 8 vcores)
    1 Filebeat nodes active: 50-70%
    2 Filebeat nodes active: 30-50%
    3 Filebeat nodes active: 20-35%
    4 Filebeat nodes active: 20-35%

  • Logstash CPU consumption (max 800%: 8 vcores)
    1 Filebeat nodes active: ~ 130%
    2 Filebeat nodes active: ~ 190%
    3 Filebeat nodes active: ~ 240%
    4 Filebeat nodes active: ~ 280%

  • Didn't notice any IOwaits on both Filebeat and Logstash sides

I'm very concerned with the results of my tests - no matter how powerful FB/LS machines are:

  • All Filebeats read disk.queue files extremely slowly - 16-18 MB/sec Max FB->LS, 60-64 MB/sec FB standalone.

  • Filebeats don't use all the resources of the machines like CPU even when output is just console and there is no network interaction with Logstash.

  • Logstash is also using just a fraction of CPU on the machine.

  • The more Filebeats are connected to the same Logstash, the worse Filebeat performance becomes (so obviously Logstash puts some back-pressure towards all Filebeats that send traffic, but why?)

Can you please help me understand if the results that I got are normal, and if not, where can the bottleneck be?

Thanks!

What is the disk type you are using in the VMs on this test?

You are using disk queues on both Filebeat and Logstash, even if you are not getting any IOwait, this can still not be as fast as you want.

Basically you seem to have something like this for filebeat.

log file -> FB Input -> write to disk queue -> read from disk queue -> FB Output 

And like this for logstash:

FB Output -> LS Input-> write to disk queue -> read from disk queue -> LS Output

Can you provide more context about this? It is a little confusing.

The disk queue is where filebeat will store the events it reads from the inputs before processing them and sending to the output, so each filebeat instance will have its own disk queue, it is not clear what you mean with 4 filebeat machines consuming the same disk queue.

What is the disk type you are using in the VMs on this test?

it is gp3 EBS (AWS) volumes. so they are quite fast.

Basically you seem to have something like this for filebeat.

correct. but for the test i'm not reading real log files. I copied the queue segments from a production node to each of 4 test filebeat nodes and trying to consume/send each copy to logstash with each filebeat.

Can you provide more context about this? It is a little confusing.
The disk queue is where filebeat will store the events it reads from the inputs before processing them and sending to the output, so each filebeat instance will have its own disk queue, it is not clear what you mean with 4 filebeat machines consuming the same disk queue.

sorry, i should have been more clear about that.

in this particular test i'm not reading any input, i just have a queue segment files that are the same on all 4 filebeat nodes. basically i copied the whole folder with queue segments to each of 4 filebeat nodes.

the goal of the test is to test how quickly filebeat can read its queue from disk and then how quickly it can send it to logstash. so I ran 2 tests - without logstash and with logstash and posted the results in the original message.

as the result of 1st test I see that filebeat can consume the queue files and output them to console with the speed of approximately 60-64 MB/sec. which is pretty low, but still not as bad as in the second test.

second tests shows reduction of the speed to 16-18 MB/sec when output is no loger console but logstash. and the more filebeats I add, the lower the speed becomes. so looks like logstash is the bottleneck here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.