Tuning filebeat performance, why i can not drive CPU usage close to 100%

Hello,
I tried to tune filebeat performance, as @steffens suggested in one post, i changed the output to console, one thing I find strange is: I allocated 3 cores to filebeat, however, i can only drive the CPU usage to two cores at 50% and one core at 20%. we are using filebeat 6.1.3.
why i can not drive CPU usage close to 100%? what might be the bottleneck? any tips in debugging the bottleneck would be appreciated.
log stats is:
2019-07-10T18:11:08-07:00 INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=30038 beat.memstats.gc_next=59367392 beat.memstats.memory_alloc=54714200 beat.memstats.memory_total=12811647016 filebeat.events.active=1247 filebeat.events.added=180043 filebeat.events.done=178796 filebeat.harvester.open_files=3 filebeat.harvester.running=3 filebeat.harvester.started=3 libbeat.config.module.running=0 libbeat.output.events.acked=178790 libbeat.output.events.batches=88 libbeat.output.events.total=178790 libbeat.output.type=console libbeat.output.write.bytes=74209763 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=1247 libbeat.pipeline.events.filtered=6 libbeat.pipeline.events.published=180037 libbeat.pipeline.events.total=180043 libbeat.pipeline.queue.acked=178790 registrar.states.current=6 registrar.states.update=178796 registrar.writes=94
2019-07-10T18:11:38-07:00 INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=29985 beat.memstats.gc_next=43650096 beat.memstats.memory_alloc=45868008 beat.memstats.memory_total=14817652576 filebeat.events.active=-1247 filebeat.events.added=174213 filebeat.events.done=175460 filebeat.harvester.open_files=3 filebeat.harvester.running=3 libbeat.config.module.running=0 libbeat.output.events.acked=175460 libbeat.output.events.batches=86 libbeat.output.events.total=175460 libbeat.output.write.bytes=73454908 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=0 libbeat.pipeline.events.published=174213 libbeat.pipeline.events.total=174213 libbeat.pipeline.queue.acked=175460 registrar.states.current=6 registrar.states.update=175460 registrar.writes=86

so events added is roughly the same as events done, so why filebeat could not add events fast enough?
filebeat yaml file:

   #=========================== Filebeat prospectors =============================                                                                                                      
    filebeat.prospectors:

    # Each - is a prospector. Most options can be set at the prospector level, so
    # you can use different prospectors for various configurations.
    # Below are the prospector specific configurations.

    - input_type: log 

      # Paths that should be crawled and fetched. Glob based paths.
      paths:
        - /export/content/lid/apps/watchmen-server/i001/logs/*

      #  closes the harvester/fd when a file is removed.
      close_removed: true

      # filebeat will focus on sending the current log files
      ignore_older: 60m 
      close_inactive: 5m

      # set max lifetime of each harvester (fd)
      close_timeout: 70m 
      #after clean, the file content will be sent again from the beginning if the same file name re-appear
      #for sparse logs, we do not mind dup in the kafka, kafka will discard old logs > 5 days
      clean_inactive: 48h 
      clean_removed: true
      symlinks: true

      ### Multiline options

      # Mutiline can be used for log messages spanning multiple lines. This is common
      # for Java Stack Traces or C-Line Continuation

      # The regexp Pattern that has to be matched. The example pattern matches all lines starting with [
      multiline.pattern: '^[[:digit:]]{4,}[[:punct:]][[:digit:]]{2,}|^[I|W|E|F|D|T][[:digit:]]{4,}'

      # Defines if the pattern set under pattern should be negated or not. Default is false.
      multiline.negate: true

      # Match can be set to "after" or "before". It is used to define if lines should be append to a pattern
      # that was (not) matched before or after or as long as a pattern is not matched based on negate.
      # Note: After is the equivalent to previous and before is the equivalent to to next in Logstash
      multiline.match: after
    #================================ processor =====================================

    processors:
     - drop_fields:
         fields: ["beat.name", "beat.version", "input_type", "@timestamp", "@metadata", "metadata.type", "type", "offset"]

    #-------------------------- Kafka output ------------------------------
    output.console:
      pretty: false

Please format logs and config files using the </> button.

Filebeat can read logs quite fast, but then one might experience back-pressure in filebeat, which again slows down the reading. Back-pressure normally occurs in the output, and writing the registry file. With console output you still are subject to back-pressure, especially if you write to another process or console (but even with /dev/null you rely on syscalls and OS to throw away contents).
When writing to /dev/null (and in scenarios with a many many files) the source of back-pressure is normally the registry file writing.

As of now the registry is an in memory store tracking states. By default the store gets dumped after every ACK from the outputs. Which means writing + fsync operations. Disk operations can produce back-pressure here. Depending on disk configuring the registry flush timeout to 5s, to reduce number of IO+sync operations often helps.

2 Likes

Btw, the change on the above parameter helped us drive up CPU usage, thanks for the information.
I am curious on what "event" infilebeat.events.added refers to.
Once i increased registry flush timeout to be 5, filebeat.events.added became much smaller.
I originally thought one event means transfering one log line. @steffens could you please help clarify on this? thanks!

Where do you read the metrics from? Added is indeed a monotonic increasing counter in code, but when reporting metrics using the log output, then Beats report most metrics as deltas between the last call. Which means that you likely see the derivative (events per 30s).

Normally I look at libbeat.pipeline and libbeat.output metrics.

Small trick: by default filebeat first buffers events and then flushes the buffer (by default every 1s). This is used to build up batches, but potentially adds some latency if the output can process events faster then they trickle in. Depending on machine setup users did report lower latency (and sometimes even higher throughput) when setting queue.mem.flush.min_events: 0 (See: Configuring the internal queue). This also adds some more work to the pipeline though, potentially increasing CPU usage without much gain... experience varies and I haven't get any consistent feedback from a number of users. Some reported higher throughput, some reported less throughput (depends on machine, output, network, overall machine load, ...).
When benchmarking with big inputs you should measure, CPU, memory and throughput, so to do some more tuning.

1 Like
  1. It seems like if we have small number of large log files, # of events shown will be the number of log files. I tried 1, 3, 7 log files.
    However, if we have 100+ log files, # of events, filebeat.events.added and libbeat.pipeline.events.published can be as high as 200K -- 600K as time changes, the oscilation may be atrributed to some log lines are short, others are long.
    typically filebeat.events.added is similar to or the same as libbeat.pipeline.events.published
    I am puzzled by this, since in the first case (small # of log files), libbeat.pipeline.events.published = number of log files; and in the case of large number of log files, libbeat.pipeline.events.published seems equal to # of log lines transferred.
    I understand that the event counter is a monotonic increasing counter, however, i am still confused on what is the definition of the event reported in the log?

  2. which of the following config syntax is correct? or the same?
    registry.flush: 5s
    registry.flush: 5

typically filebeat.events.added is similar to or the same as libbeat.pipeline.events.published

That's about right. The later is used for monitoring, the former rather a means for us to debug filebeat being in sync with the publisher pipeline. Filebeat itself also counts the number of events, which is used for the shutdown_timeout setting. Depending on when they are counted and when the metrics snapshot is taken, one might be a little higher than the other.
If you start filtering out events e.g. via drop_event filter, then you might also get libbeat.pipeline.events.filtered. In this case filebeat.events.added is about the sum of events.published and events.filtered. There should also be a libbeat.pipeline.events.total.

I am puzzled by this, since in the first case (small # of log files), libbeat.pipeline.events.published = number of log files; and in the case of large number of log files, libbeat.pipeline.events.published seems equal to # of log lines transferred.

The counter is on per event base.

In which state was your registry file when starting up filebeat? Did you have a clean one or an old one? Was filebeat publishing events to the console in this scenario?

Filebeat also sends internal events not to be published to the registry file. These events need to pass the publisher pipeline as well, so to guarantee order between 'file finished' events (or similar) and normal events publishing that updates the file offset in the registry file only after ACK from the outputs. But I'm not exactly sure if/how/when they are accounted for in any of the metrics.
On startup the registry presents all known states to all the configured inputs. The inputs check if the files still exists, start collecting them from the old offset, and send state updates to to the registry, allowing the registry to update paths (if file has been renamed) or add/remove entries on startup (e.g. if inputs have been removed/added).

Maybe the registry did exist in your case and was already pointing to the end of the files?

which of the following config syntax is correct? or the same?
registry.flush: 5s
registry.flush: 5

The setting accepts an unit like ns, us, ms, s, m, and h. If no unit is given, then it defaults to seconds. In general I prefer to use any duration based settings with units.

1 Like

(1) sorry, I should have stated that, before each experiment (with each config change), I specifically cleaned up the "path.data" directory, i.e., I removed both meta.json & registry. is there another registry file hidden somewhere?
my filebeat command line is:
bin/filebeat -path.config genConfig -path.data data/beats/ -path.logs logs > /dev/null
(2) Yes, filebeat always publishes events to the console.
output.console:
pretty: false

Your setup looks ok. Running with -d '*,config' should print your config to the logs. You can verify paths used and if config is correct from logs output.

I prefer to run filebeat with -c <path>/<to>/<config> when testing/benchmarking. For this I normally create a bunch of separate config files, one for inputs, one for output, one for queue settings, and then I run my scenarios by combining the correct config files via (this makes it easy to test the same scenario with LS/ES/Kafka while tuning those services):

$ rm -fR data; filebeat -httpprof :6060 -e -v -c input_small.yml -c queue.yml -c console.yml | pv -Warl > /dev/null

Also note the use of -httpprof :6060. This creates an HTTP endpoint for profiling and live collection of metrics. I use scripts to collect metrics every second and display the current and (moving) average rate of some queue metrics. You could also run metricbeat against this filebeat and index the benchmarking metrics into Elasticsearch (or just enable x-pack monitoring and have the same data written to Elasticsearch).

I prefer to log directly to console while testing. This works as events are written to stdout, and the logs are written to stderr.

Can you reliably reproduce the issue? If so, could you get us the scripts/configs for testing?

1 Like

sure, I can try to repro. we do not use Elastic search, we only use filebeat to send logs to kafka, and output to console in my benchmarking.
I got some un-intuitive results, on production environment, flush timeout of 5s (or 10s) has lower throughput than flush timeout of 1s.
on my dev box, flush timeout of 5s sometimes has higher(sometimes lower) throughput then flush timeout of 1s.
I used the same config on both my dev box and the production machine, the difference is on the log files to be published. I created a few super large log files on my dev box, each of which has one to a few million log lines, each log line would generate one libbeat event published?
any conjecture on why 1s of flush timeout produces higher throughput than the case of flush timeout=5? I thought longer flush timeout means less writes to the registry, which in turns higher throughput?

  • default flush timeout (which 0?) is a consistently worse than flush timeout of 1s.

I thought longer flush timeout means less writes to the registry, which in turns higher throughput?

Well, there is no linear dependency between flush timeout and throughput. Once flush timeout is big enough, such that the registry can continue processing new events once they become available, then the registry does not add any noticable back-pressure anymore. Still, if disk is under load, a higher timeout can 'safe' you some time. Drawback of higher timeout is: if filebeat or the machine crashes, then we have to send duplicates.
To me it sounds like 1s is already a good parameter. The differences in throughput (higher/lower) might be based on other causes.

Processing in filebeat is highly asynchronous, and therefore requires some communication and coordination on shared resources. Maybe with 1s you still have a little back-pressure, which causes some subsystems to block, which potentially reduces contention on other shared structures. Kind of a domino effect :slight_smile: But this is just some guessing work I can't confirm from 'remote debugging'.

In production (when sending to a remote service), changes in timing and throughput within beats could also affect networks. If you attempt to send more data, then you're transmitting more packets over network. Depending on infrastructure (e.g. buffering capabilities) there is the chance of packet drops in the network once you hit a sweet spot.

On another note: How many cores does your machine have? And how many OS threads are busy? At times the go runtime can go bonkers if the ratio between cores available/busy OS threads is large. This shows in unnecesarrily high CPU usage, contention on the scheduler and sometimes overall less performance/throughput. If you only use up to 100% CPU (one core only), then set the number of available cores to filebeat to 1 or 2 only (max_procs: 1 in your config file).

I created a few super large log files on my dev box, each of which has one to a few million log lines, each log line would generate one libbeat event published?

correct.

1 Like

There are 12 cores on on my dev box, 24 cores on my production box where I do pref tuning. On dev box and prod box, i allocated the last 3 cores on my machine to filebeat and pined filebeat to those 3 cores, i.e., i set
echo $GOMAXPROCS
3
filebeat.max_procs: 3
my command line:
sudo -u app taskset -c 21,22,23 bin/filebeat -path.config genConfig -path.data /export/content/data/in-beats/i001 -path.logs logs
The production box I used for perf experiments is lightloaded otherwise. with 3 allocated cores, filebeat CPU usage is ~200%, it seems like filebeat only gets to use ~2 cores. "mpstat -P ALL 1" shows each of those 3 cores still has ~30% unused CPU, which adds up to 300%.
I was surprised to find out that filebeat created 62 os threads with $GOMAXPROCS = 3 (only 24 cores on this machine).
ps -T -p 14759
PID SPID TTY TIME CMD
14759 14759 pts/7 00:02:15 filebeat
14759 14760 pts/7 00:00:00 filebeat
14759 14761 pts/7 00:00:00 filebeat
14759 14762 pts/7 00:00:00 filebeat
14759 14763 pts/7 00:00:00 filebeat
14759 14764 pts/7 00:00:00 filebeat
14759 14765 pts/7 00:00:00 filebeat
14759 14766 pts/7 00:00:00 filebeat
14759 14767 pts/7 00:00:00 filebeat
14759 14768 pts/7 00:03:04 filebeat
14759 14769 pts/7 00:00:00 filebeat
14759 14770 pts/7 00:00:00 filebeat
14759 14771 pts/7 00:00:00 filebeat
14759 14772 pts/7 00:00:00 filebeat
14759 14773 pts/7 00:00:00 filebeat
14759 14774 pts/7 00:00:00 filebeat
14759 14775 pts/7 00:00:00 filebeat
14759 14776 pts/7 00:00:00 filebeat
14759 14777 pts/7 00:00:00 filebeat
14759 14778 pts/7 00:00:00 filebeat
14759 14779 pts/7 00:02:42 filebeat
14759 14780 pts/7 00:02:36 filebeat
14759 14781 pts/7 00:00:00 filebeat
14759 14782 pts/7 00:00:00 filebeat
14759 14783 pts/7 00:00:00 filebeat
14759 14784 pts/7 00:00:00 filebeat
14759 14785 pts/7 00:00:00 filebeat
14759 14786 pts/7 00:00:00 filebeat
14759 14787 pts/7 00:00:00 filebeat
14759 14788 pts/7 00:00:00 filebeat
14759 14789 pts/7 00:00:00 filebeat
14759 14790 pts/7 00:00:00 filebeat
14759 14791 pts/7 00:00:00 filebeat
14759 14792 pts/7 00:00:00 filebeat
14759 14793 pts/7 00:00:00 filebeat
14759 14794 pts/7 00:00:00 filebeat
14759 14795 pts/7 00:00:00 filebeat
14759 14796 pts/7 00:00:00 filebeat
14759 14797 pts/7 00:00:00 filebeat
14759 14798 pts/7 00:00:00 filebeat
14759 14799 pts/7 00:00:00 filebeat
14759 14800 pts/7 00:00:00 filebeat
14759 14801 pts/7 00:00:00 filebeat
14759 14802 pts/7 00:00:00 filebeat
14759 14803 pts/7 00:00:00 filebeat
14759 14804 pts/7 00:00:00 filebeat
14759 14805 pts/7 00:00:00 filebeat
14759 14806 pts/7 00:00:00 filebeat
14759 14807 pts/7 00:02:23 filebeat
14759 14808 pts/7 00:00:00 filebeat
14759 14809 pts/7 00:00:00 filebeat
14759 14810 pts/7 00:00:00 filebeat
14759 14811 pts/7 00:00:00 filebeat
14759 14812 pts/7 00:00:00 filebeat
14759 14813 pts/7 00:00:00 filebeat
14759 14814 pts/7 00:00:00 filebeat
14759 14815 pts/7 00:00:00 filebeat
14759 14817 pts/7 00:02:43 filebeat
14759 14818 pts/7 00:01:12 filebeat
14759 14819 pts/7 00:00:00 filebeat
14759 14820 pts/7 00:00:00 filebeat

I could not drive up the filebeat CPU usage to 300% with any of config values (0, 1, 5, 10) . i am wondering where is the bottleneck? I plan to tune queue parameters with flush timeout set to 1.
(1) I observed increased throughput (w/o driving up CPU usage though) when setting bulk_max_size (in output section) = 4096. if i understand it correctly, the default value is 2048.
(2) set worker:2 did not provide meaningful throughput increase.
(3) my optimal config set for output to /dev/null seems to be
registry.flush: 1
bulk_max_size: 4096
worker:1 (even though i allocated 3 cores to filebeat).
My Q is: would these config values would still apply once I change output to kafka?
or I have benchmark all over again? that would be a pain:(

I was surprised to find out that filebeat created 62 os threads with $GOMAXPROCS = 3 (only 24 cores on this machine).

Most likely due to blocking IO. When having to do blocking calls, the go runtime create another OS thread to handle go-routines. Once the blocking call returns OS threads might time out and be removed. Using htop you can see OS threads + their CPU usage live.

I don't think bulk_max_size is supported by the console output.

My Q is: would these config values would still apply once I change output to kafka?
or I have benchmark all over again? that would be a pain:(

Right, you start optimizing the output here. If you change the output, then things might look very different. With kafka output, you will preceive back-pressure from Kafka and network directly.

(1) I observed increased throughput (w/o driving up CPU usage though) when setting bulk_max_size (in output section) = 4096. if i understand it correctly, the default value is 2048.

Depends on the output. But when modifying the bulk_max_size, also keep the queue size into account. The queue should be bigger bulk_max_size * worker, as events are split up into batches and you can only have max B = queue size / bulk_max_size batches in the pipeline. You want B > worker. In case of kafka output it doesn't make sense to increase workers, as the kafka output load balances events to the different brokers. Which means that one slow broker can stall filebeat (the slowest filebeat->broker connection mandates overall throughput).

1 Like

can you elaborate on this one? Why a single worker would lead to higher throughput?
I thought 1 worker would make it worse, since everything would be stalled on this single slow connection.

Thanks for the tips on the queue size! which of the following is the queue size? is it queue.mem.events? Does flush.min_events need to be ">=" bulk_max_size?
queue.mem:
events: 4096
flush.min_events: 512
flush.timeout: 5s