Filebeat CPU usage


(Matthias Wilhelm) #1

We're using filebeat to forward logs to a redis server, then they're processed with logstash and indexed by elasticsearch

One of our customers complains, that filebeat needs more resources than the process that writes the logs

In his opinion there's something wrong with it, blames the regular expression (his old log forwarder, just needs about 1% CPU - just forwards plain text messages via UDP, which is a total different scenario, however target is, to minimise the performance impact the new log pipeline). so we did a few tests to check it.

so we took a log file, written while lots of traffic (501Mbyte, 1944137 events in 371s, 5240 events / second), and forwarded it with filebeat to our redis queue. We limited the cpus via max_procs to 1.

Filebeat (6.4, running under CentOs, Xeon(R) CPU E5-2640) took about 93s (21k/sec) to forward the logs to REDIS, using 100% of 1 cpu. so we estimated, it would use about 29% of 1 cpu when it would submit the logs while being written.

We then did several tests e.g.

  • without regexp it took 117s (38k /sec, but it since it had to submit much more messages, this is not compareable )
  • using plain codec 74s(26k/sec)
  • using plain codec, writing to console 63s(31k/s)

We think the CPU usage of filebeat is normal, or did we overlook a magic switch to lower it significantly?

Here's our filebeat.yml for the default test:

- input_type: log
  paths:
  - "${path.home}/data/ourservice/*.log"
  fields:
    type: ourtype
    service: ourservice
    hostname: hostxy
    ipV4: 127.0.0.1

  multiline.pattern: '^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[T][[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}'
  multiline.negate: true
  multiline.match: after
  tail_files: false
  close_eof: true

#------------------------------- Redis output ----------------------------------
output.redis:
  # Boolean flag to enable or disable the output module.
  enabled: true

  hosts: ["127.0.0.2:6379"]

  # default logs.
  key: logs

  #the priority lane for correlation records
  keys:
  - key: correlation
    when.contains:
      message: "H_trnsp"

Thank you so much for your feedback, i could also offer a cpu profile for the default test, if it would be useful.


(Steffen Siering) #2

Which filebeat version?

Without CPU profile I can hardly point my finger on any hotspots. Just some things coming to mind:

The multiline regex might not be a regex. This PR introduced some "optimizer", trying to replace regular expressions with some very simple string or date matchers, so to bypass the much more expensive regex engine. I'm not sure if the [T] in your regex will disable the optimization, though. But a pattern like ^\d{4}-\d{2}-\d{2} should fit your case and it will be optimized to a simple hand-coded prefix check.
These benchmark results do match this pattern:

|BenchmarkPatterns/Name=startsWithDate2,_Matcher=Regex,_Content=mixed-4                              | 1000000|      2021 ns/op|
|BenchmarkPatterns/Name=startsWithDate2,_Matcher=Match,_Content=mixed-4                              |10000000|       117 ns/op|
|BenchmarkPatterns/Name=startsWithDate2,_Matcher=Regex,_Content=simple_log-4                         | 1000000|      1879 ns/op|
|BenchmarkPatterns/Name=startsWithDate2,_Matcher=Match,_Content=simple_log-4                         |20000000|       110 ns/op|
|BenchmarkPatterns/Name=startsWithDate2,_Matcher=Regex,_Content=simple_log2-4                        |  500000|      2732 ns/op|
|BenchmarkPatterns/Name=startsWithDate2,_Matcher=Match,_Content=simple_log2-4                        | 5000000|       277 ns/op|
|BenchmarkPatterns/Name=startsWithDate2,_Matcher=Regex,_Content=simple_log_with_level-4              | 1000000|      1487 ns/op|
|BenchmarkPatterns/Name=startsWithDate2,_Matcher=Match,_Content=simple_log_with_level-4              |20000000|        99.5 ns/op|

Filebeat copies the content into temporary buffers (so to deal with potential file truncation on rotation strategies), plus multiline has to combine existing buffered lines. This does not come for free and produces some garbage which needs to be cleaned.

Filebeat does not just send the plain contents as is, but generates structures events with read timestamps + it adds additional meta-data per event. The final event is then JSON encoded when publishing to redis.
From experience most CPU is consumed due to the JSON encoding + garbage collection of published events.

The output.redis.keys setting is executed per event. This one searches the string in message either linearly (using CPU instructions) or using Rabin-Karb (Worst time complexity O((n-m)*m)), depending on the length of the search pattern.

As you noticed CPU usage very much depends on event throughput. Have you considered to get more actual number by rate limiting the number of events being received from filebeat? I do this by having my own Logstash fake server, but I don't know any good project for redis doing so. Using libs like go-redis-server or redcon this should be doable.

Internally filebeat operates fully asynchronous. That is even with 1 processor, while it's waiting for the response from redis, it collects and processes new events. One can also try to limit the queue sizes and bulk_max_size so, to reduce the amount of concurrent work. Once the queue is full, it blocks waiting for ACK from redis (assuming bulk_max_size and queue.mem.events have the same values). The total amount of work is still the same though.


(Matthias Wilhelm) #3

Hi thank you very much for your feedback

So, here's the cpu profile of my test yesterday

https://www.icloud.com/iclouddrive/0MqVPUqErD_uKiiA1qaqTHz2Q#cpuprofile

so i did several tests:

Filebeat 5.6.1 → Redis (100% cpu usage). Time: 180s

Filebeat 6.4.0 → Redis. 100% CPU usage, Time: 90,2s

Filebeat 6.4.0 → Redis, 100% CPU, adapted date regexp. Time 89,8s

Filebeat 6.4.0 → Redis , 100% CPU, removed redis regexp. Time: 84,2s

Filebeat 6.4.0 → Redis, 100% CPU, adapted regexp, plain codec (no json). Time: 63,2s

Filebeat 6.4.0 → Logstash, 78% CPU, adapted regexp, Time 96,2s

Filebeat 6.4.0 → Redis, 100% CPU, just sending signaling messages (5% of the log messages) 46,7s

I've added some metrics in a document
https://www.icloud.com/iclouddrive/0l6nec6eRz8UrJXAuDQMSlU-A#Ohne_Titel

My Summary;

  • The improvement from Filebeat 5.6 to 6.4 is impressing :+1:,
  • Adapting the Regexp brought small performance gains
  • Using a plain codes instead of the default JSON led to a much better performance

So i think we did what we could performance wise (in the current architecture, using just filebeat to submit logs), haven't we?


(Matthias Wilhelm) #4

So, we now rolled out Filebeat 6.4.3 to the systems of our customer, so i wanted to share the results

It's matches the results in our lab (Hurray!). Compared to filebeat 5.6, 6.4.3 ships the logs in 50% of the time, so performance doubled, avg CPU usage dropped to about 30%. Kudos to filebeat engineering!


(Matthias Wilhelm) #5

In another iteration we're thinking of using plain codec for shipping logs, our test case PoC had a performance gain of 25% for this case. Maybe there is some potential for improvement in the JSON encoder filebeat uses? however shipping plain codec is performance wise better, that's clear


(Steffen Siering) #6

The CPU profile is quite noise. I could only open the complete file, but when trying to filter out symbols I couldn't load the file :frowning: .

Anyways, it is very clear from the profile that most CPU is actually used for JSON encoding. Loads of work is done on strings. When encoding to JSON the encoder also has to check all strings for symbols that need to be escaped. And the more rich an event is (more fields), the more work will the encoder have to do. Having a many strings also increases amount of work. Using the plain string formatter can indeed improve performance, because it does not do any escaping + you select only a subset of fields available in the event (JSON encoding is actually a quite expensive operation).

If you want to continue with the JSON encoder, check if you can use the drop_fields filter to remove event fields you don't need.

I also noticed the harvester registry being somewhat busy with state updates. Could it be that you have a many files in your directory? But maybe it's just due to increased throughput, as updates do happen per line.

So i think we did what we could performance wise (in the current architecture, using just filebeat to submit logs), haven't we?

There is one more knob you can try to tune. Sometimes it works, sometimes not. The registry file itself can create additional back-pressure. Normally the file is written asynchronously, but it requires fsync to write the file reliably. At times queues can fill up, because IO+fsync takes too long, such that the registry file creates additional back-pressure. You can reduce registry writes (at the risk of duplicates) by setting filebeat.registry_flush: 1s (or higher). If filebeat is stopped cleanly, it always updates the registry file one last time. Only if filebeat or the machine crashes hard will updates be missing -> duplicates will be send.

I think we still have some room for improvement in the code base (no low-hanging fruits though), but not much on the JSON encoding. We already use a custom JSON encoder that is more CPU and memory efficient then the default one. The best alternative to plain/json codec would be some binary encoding (lib we use supports cbor and ub-json, e.g. we use CBOR for spooling to disk).


(Matthias Wilhelm) #7

Hi

Since it's a test case, the registry file is created at the start of the test case, i doubt there's a problem. I currently can't test the registry_flush, but i will keep in in the backlog of my brain :slight_smile:
you mention a binary encoding, is this already possible?

So far, our customer seems to be satisfied with the improvements, he's testing it for one week in one of his systems.

thanks,
Matthias


(Steffen Siering) #8

No, binary encodings are not exposed yet. No annoyance with binary encodings is the quality of support in different libs/language.


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.