Filebeat only operating at 10%-15% of expected Performance

Hi,

I have invested quite some time profiling configurations of filebeat (7.6.2) in the past few days, since it is currently running very slow. My test dataset consists of ~2GB ndjson files, at 10 MB per file, with ~6 million json-objects/log-lines total. Each json object contains about 8-20 key-value pairs.

After trying out all configuration change suggestions I could find, I still only reach about 10k docs/s. This means I am waiting 10 minutes on my 2GB test-dataset. Production datasets will be somewhere around 50-200GB. For these sizes, 10k is basically unusable.

Now, the cause of this slow indexing rate could of course be many things, like network, IO, CPU, Ram, Elasticsearch settings and so on. So I started to eliminate each of these causes:

I run Elasticsearch (7.6.2) on the same local machine as Filebeat to eliminate network (I was seeing 8-16kbit/s before). I could also see that during indexing, the system (Windows 10, 12c/24t intel Xeon, 32 GB Ram, NVMe ssd) was basically sitting idle (<10% CPU, 16GB free ram, <10% sdd active time/data rate), so CPU, Ram and IO are also out of question.

To eliminate Elasticsearch as the culprit, I wrote a naive python-script, that just reads the files and sends them to Elastic via the bulk API. After some fiddling around, I managed to achieve an index rate of 55k docs/s singlethreaded, and 83k docs/s multithreaded (10 threads, although 3 threads already hits 80k). So Elasticsearch is also not the bottleneck.

This only leaves filebeat as the bottleneck. It only achieves 11% of the performance of a script that was hacked together in 3 hours. I hope there is something I can change in the configuration to get filebeat to at least similar speeds. This is my current configuration:

filebeat.inputs:
- type: log
  enabled: true
  json.keys_under_root: true
  json.add_error_key: true
  paths:
    - O:\elastic_logs\log_20210122_oracle\log\*.lf2
 
setup.template.name: "gb30-lf2"
setup.template.pattern: "gb30-lf2*"
 
setup.ilm.enabled: false
name: oracle
 
queue.mem.events: 600000
 
output.elasticsearch:
  hosts: ["http://localhost:9200"]
  workers: 10
  pipelining: 4
  bulk_max_size: 3000
  compression_level: 0
  index: "gb30-aws-oracle-2021.01.22"

I have, in 25 runs, tried various combinations of the following settings:

workers: not set, 1, 3, 7, 10, 15, 20, 30
bulk_max_size: not set, 10, 200, 3000, 4000, 5000, 6000, 7000, 20000
pipelining: not set, 4, 100
loadbalance: not set, true, false
queue.mem.events: not set, 2000, 60000, 600000
compression_level: nost_set, 0, 1, 2, 3, 5, 9, 10

Besides bulk_max_size and workers, none of these settings has showed an impact on the performance. The number of workers above 3 seems to be irrelevant to the bottleneck, and my sweetspot of bulk_max is in the 3000-4000 range. Lower and higher values performed worse. What else can I try?

I have also tried to just run 6 filebeats on the same data at the same time. They managed 23k docs/s, which is better... kind of? I hoped that increasing workers would have a similar effect, since after all, the only real differences are the number of threads, but they don't.

I should note that my machine was still not really impressed by the 83k: CPU was at 50%, RAM to spare, and the ssd, a drive that can sustain 3GB/s reads and writes, was averaging at about 200 MB/s and maybe 40% busy time (That number is the total read-/write activity on the drive, not what actually went over the sockets). Maybe elastic was the bottleneck at that point (Although it didn't perform any different with 10GB heap).

Still, judging by those numbers 160k docs/s should be perfectly doable, given that elastic and filebeat manage to use the resources they are given. I also think 83k reasonably can only be a lower performance bound, since after all, I am comparing to a crude python script here. I hope there are some configuration options left that I overlooked, that can help me reach that goal!

It sounds like your Python script doesn't have any queuing going on between reading data from the source files and sending to Elasticsearch. To achieve the same effect in Filebeat, maybe try setting flush.min_events: 0 and flush.timeout: 0s in your Filebeat configuration. Does it make any difference in performance?

This doesn't help unfortunately. To clarify: The script actually does some queueing: it aggregates the lines into bulks of 3000 an sends them using elastics bulk-api. There just is no timeout since it assumes the files do not grow.

Is this all I can try to solve this issue? Tbh I thought filebeat underperforming in a mayor way would spark some more engagement, but it doesn't seem like it and I am not too sure as to why.

Is this performance expected? Is there simply no way to make it go faster? I took filbeat to be a production ready tool, but with the current performance I am seeing, it doesn't act like one.

Are there more mature alternatives I can try?