Tuning filebeat performance, why i can not drive CPU usage close to 100%

Hello,
I tried to tune filebeat performance, as @steffens suggested in one post, i changed the output to console, one thing I find strange is: I allocated 3 cores to filebeat, however, i can only drive the CPU usage to two cores at 50% and one core at 20%. we are using filebeat 6.1.3.
why i can not drive CPU usage close to 100%? what might be the bottleneck? any tips in debugging the bottleneck would be appreciated.
log stats is:
2019-07-10T18:11:08-07:00 INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=30038 beat.memstats.gc_next=59367392 beat.memstats.memory_alloc=54714200 beat.memstats.memory_total=12811647016 filebeat.events.active=1247 filebeat.events.added=180043 filebeat.events.done=178796 filebeat.harvester.open_files=3 filebeat.harvester.running=3 filebeat.harvester.started=3 libbeat.config.module.running=0 libbeat.output.events.acked=178790 libbeat.output.events.batches=88 libbeat.output.events.total=178790 libbeat.output.type=console libbeat.output.write.bytes=74209763 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=1247 libbeat.pipeline.events.filtered=6 libbeat.pipeline.events.published=180037 libbeat.pipeline.events.total=180043 libbeat.pipeline.queue.acked=178790 registrar.states.current=6 registrar.states.update=178796 registrar.writes=94
2019-07-10T18:11:38-07:00 INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=29985 beat.memstats.gc_next=43650096 beat.memstats.memory_alloc=45868008 beat.memstats.memory_total=14817652576 filebeat.events.active=-1247 filebeat.events.added=174213 filebeat.events.done=175460 filebeat.harvester.open_files=3 filebeat.harvester.running=3 libbeat.config.module.running=0 libbeat.output.events.acked=175460 libbeat.output.events.batches=86 libbeat.output.events.total=175460 libbeat.output.write.bytes=73454908 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=0 libbeat.pipeline.events.published=174213 libbeat.pipeline.events.total=174213 libbeat.pipeline.queue.acked=175460 registrar.states.current=6 registrar.states.update=175460 registrar.writes=86

so events added is roughly the same as events done, so why filebeat could not add events fast enough?
filebeat yaml file:

   #=========================== Filebeat prospectors =============================                                                                                                      
    filebeat.prospectors:

    # Each - is a prospector. Most options can be set at the prospector level, so
    # you can use different prospectors for various configurations.
    # Below are the prospector specific configurations.

    - input_type: log 

      # Paths that should be crawled and fetched. Glob based paths.
      paths:
        - /export/content/lid/apps/watchmen-server/i001/logs/*

      #  closes the harvester/fd when a file is removed.
      close_removed: true

      # filebeat will focus on sending the current log files
      ignore_older: 60m 
      close_inactive: 5m

      # set max lifetime of each harvester (fd)
      close_timeout: 70m 
      #after clean, the file content will be sent again from the beginning if the same file name re-appear
      #for sparse logs, we do not mind dup in the kafka, kafka will discard old logs > 5 days
      clean_inactive: 48h 
      clean_removed: true
      symlinks: true

      ### Multiline options

      # Mutiline can be used for log messages spanning multiple lines. This is common
      # for Java Stack Traces or C-Line Continuation

      # The regexp Pattern that has to be matched. The example pattern matches all lines starting with [
      multiline.pattern: '^[[:digit:]]{4,}[[:punct:]][[:digit:]]{2,}|^[I|W|E|F|D|T][[:digit:]]{4,}'

      # Defines if the pattern set under pattern should be negated or not. Default is false.
      multiline.negate: true

      # Match can be set to "after" or "before". It is used to define if lines should be append to a pattern
      # that was (not) matched before or after or as long as a pattern is not matched based on negate.
      # Note: After is the equivalent to previous and before is the equivalent to to next in Logstash
      multiline.match: after
    #================================ processor =====================================

    processors:
     - drop_fields:
         fields: ["beat.name", "beat.version", "input_type", "@timestamp", "@metadata", "metadata.type", "type", "offset"]

    #-------------------------- Kafka output ------------------------------
    output.console:
      pretty: false

Please format logs and config files using the </> button.

Filebeat can read logs quite fast, but then one might experience back-pressure in filebeat, which again slows down the reading. Back-pressure normally occurs in the output, and writing the registry file. With console output you still are subject to back-pressure, especially if you write to another process or console (but even with /dev/null you rely on syscalls and OS to throw away contents).
When writing to /dev/null (and in scenarios with a many many files) the source of back-pressure is normally the registry file writing.

As of now the registry is an in memory store tracking states. By default the store gets dumped after every ACK from the outputs. Which means writing + fsync operations. Disk operations can produce back-pressure here. Depending on disk configuring the registry flush timeout to 5s, to reduce number of IO+sync operations often helps.

2 Likes

Btw, the change on the above parameter helped us drive up CPU usage, thanks for the information.
I am curious on what "event" infilebeat.events.added refers to.
Once i increased registry flush timeout to be 5, filebeat.events.added became much smaller.
I originally thought one event means transfering one log line. @steffens could you please help clarify on this? thanks!

Where do you read the metrics from? Added is indeed a monotonic increasing counter in code, but when reporting metrics using the log output, then Beats report most metrics as deltas between the last call. Which means that you likely see the derivative (events per 30s).

Normally I look at libbeat.pipeline and libbeat.output metrics.

Small trick: by default filebeat first buffers events and then flushes the buffer (by default every 1s). This is used to build up batches, but potentially adds some latency if the output can process events faster then they trickle in. Depending on machine setup users did report lower latency (and sometimes even higher throughput) when setting queue.mem.flush.min_events: 0 (See: Configuring the internal queue). This also adds some more work to the pipeline though, potentially increasing CPU usage without much gain... experience varies and I haven't get any consistent feedback from a number of users. Some reported higher throughput, some reported less throughput (depends on machine, output, network, overall machine load, ...).
When benchmarking with big inputs you should measure, CPU, memory and throughput, so to do some more tuning.

1 Like
  1. It seems like if we have small number of large log files, # of events shown will be the number of log files. I tried 1, 3, 7 log files.
    However, if we have 100+ log files, # of events, filebeat.events.added and libbeat.pipeline.events.published can be as high as 200K -- 600K as time changes, the oscilation may be atrributed to some log lines are short, others are long.
    typically filebeat.events.added is similar to or the same as libbeat.pipeline.events.published
    I am puzzled by this, since in the first case (small # of log files), libbeat.pipeline.events.published = number of log files; and in the case of large number of log files, libbeat.pipeline.events.published seems equal to # of log lines transferred.
    I understand that the event counter is a monotonic increasing counter, however, i am still confused on what is the definition of the event reported in the log?

  2. which of the following config syntax is correct? or the same?
    registry.flush: 5s
    registry.flush: 5

typically filebeat.events.added is similar to or the same as libbeat.pipeline.events.published

That's about right. The later is used for monitoring, the former rather a means for us to debug filebeat being in sync with the publisher pipeline. Filebeat itself also counts the number of events, which is used for the shutdown_timeout setting. Depending on when they are counted and when the metrics snapshot is taken, one might be a little higher than the other.
If you start filtering out events e.g. via drop_event filter, then you might also get libbeat.pipeline.events.filtered. In this case filebeat.events.added is about the sum of events.published and events.filtered. There should also be a libbeat.pipeline.events.total.

I am puzzled by this, since in the first case (small # of log files), libbeat.pipeline.events.published = number of log files; and in the case of large number of log files, libbeat.pipeline.events.published seems equal to # of log lines transferred.

The counter is on per event base.

In which state was your registry file when starting up filebeat? Did you have a clean one or an old one? Was filebeat publishing events to the console in this scenario?

Filebeat also sends internal events not to be published to the registry file. These events need to pass the publisher pipeline as well, so to guarantee order between 'file finished' events (or similar) and normal events publishing that updates the file offset in the registry file only after ACK from the outputs. But I'm not exactly sure if/how/when they are accounted for in any of the metrics.
On startup the registry presents all known states to all the configured inputs. The inputs check if the files still exists, start collecting them from the old offset, and send state updates to to the registry, allowing the registry to update paths (if file has been renamed) or add/remove entries on startup (e.g. if inputs have been removed/added).

Maybe the registry did exist in your case and was already pointing to the end of the files?

which of the following config syntax is correct? or the same?
registry.flush: 5s
registry.flush: 5

The setting accepts an unit like ns, us, ms, s, m, and h. If no unit is given, then it defaults to seconds. In general I prefer to use any duration based settings with units.

1 Like

(1) sorry, I should have stated that, before each experiment (with each config change), I specifically cleaned up the "path.data" directory, i.e., I removed both meta.json & registry. is there another registry file hidden somewhere?
my filebeat command line is:
bin/filebeat -path.config genConfig -path.data data/beats/ -path.logs logs > /dev/null
(2) Yes, filebeat always publishes events to the console.
output.console:
pretty: false

Your setup looks ok. Running with -d '*,config' should print your config to the logs. You can verify paths used and if config is correct from logs output.

I prefer to run filebeat with -c <path>/<to>/<config> when testing/benchmarking. For this I normally create a bunch of separate config files, one for inputs, one for output, one for queue settings, and then I run my scenarios by combining the correct config files via (this makes it easy to test the same scenario with LS/ES/Kafka while tuning those services):

$ rm -fR data; filebeat -httpprof :6060 -e -v -c input_small.yml -c queue.yml -c console.yml | pv -Warl > /dev/null

Also note the use of -httpprof :6060. This creates an HTTP endpoint for profiling and live collection of metrics. I use scripts to collect metrics every second and display the current and (moving) average rate of some queue metrics. You could also run metricbeat against this filebeat and index the benchmarking metrics into Elasticsearch (or just enable x-pack monitoring and have the same data written to Elasticsearch).

I prefer to log directly to console while testing. This works as events are written to stdout, and the logs are written to stderr.

Can you reliably reproduce the issue? If so, could you get us the scripts/configs for testing?

1 Like

sure, I can try to repro. we do not use Elastic search, we only use filebeat to send logs to kafka, and output to console in my benchmarking.
I got some un-intuitive results, on production environment, flush timeout of 5s (or 10s) has lower throughput than flush timeout of 1s.
on my dev box, flush timeout of 5s sometimes has higher(sometimes lower) throughput then flush timeout of 1s.
I used the same config on both my dev box and the production machine, the difference is on the log files to be published. I created a few super large log files on my dev box, each of which has one to a few million log lines, each log line would generate one libbeat event published?
any conjecture on why 1s of flush timeout produces higher throughput than the case of flush timeout=5? I thought longer flush timeout means less writes to the registry, which in turns higher throughput?

  • default flush timeout (which 0?) is a consistently worse than flush timeout of 1s.

I thought longer flush timeout means less writes to the registry, which in turns higher throughput?

Well, there is no linear dependency between flush timeout and throughput. Once flush timeout is big enough, such that the registry can continue processing new events once they become available, then the registry does not add any noticable back-pressure anymore. Still, if disk is under load, a higher timeout can 'safe' you some time. Drawback of higher timeout is: if filebeat or the machine crashes, then we have to send duplicates.
To me it sounds like 1s is already a good parameter. The differences in throughput (higher/lower) might be based on other causes.

Processing in filebeat is highly asynchronous, and therefore requires some communication and coordination on shared resources. Maybe with 1s you still have a little back-pressure, which causes some subsystems to block, which potentially reduces contention on other shared structures. Kind of a domino effect :slight_smile: But this is just some guessing work I can't confirm from 'remote debugging'.

In production (when sending to a remote service), changes in timing and throughput within beats could also affect networks. If you attempt to send more data, then you're transmitting more packets over network. Depending on infrastructure (e.g. buffering capabilities) there is the chance of packet drops in the network once you hit a sweet spot.

On another note: How many cores does your machine have? And how many OS threads are busy? At times the go runtime can go bonkers if the ratio between cores available/busy OS threads is large. This shows in unnecesarrily high CPU usage, contention on the scheduler and sometimes overall less performance/throughput. If you only use up to 100% CPU (one core only), then set the number of available cores to filebeat to 1 or 2 only (max_procs: 1 in your config file).

I created a few super large log files on my dev box, each of which has one to a few million log lines, each log line would generate one libbeat event published?

correct.

1 Like

There are 12 cores on on my dev box, 24 cores on my production box where I do pref tuning. On dev box and prod box, i allocated the last 3 cores on my machine to filebeat and pined filebeat to those 3 cores, i.e., i set
echo $GOMAXPROCS
3
filebeat.max_procs: 3
my command line:
sudo -u app taskset -c 21,22,23 bin/filebeat -path.config genConfig -path.data /export/content/data/in-beats/i001 -path.logs logs
The production box I used for perf experiments is lightloaded otherwise. with 3 allocated cores, filebeat CPU usage is ~200%, it seems like filebeat only gets to use ~2 cores. "mpstat -P ALL 1" shows each of those 3 cores still has ~30% unused CPU, which adds up to 300%.
I was surprised to find out that filebeat created 62 os threads with $GOMAXPROCS = 3 (only 24 cores on this machine).
ps -T -p 14759
PID SPID TTY TIME CMD
14759 14759 pts/7 00:02:15 filebeat
14759 14760 pts/7 00:00:00 filebeat
14759 14761 pts/7 00:00:00 filebeat
14759 14762 pts/7 00:00:00 filebeat
14759 14763 pts/7 00:00:00 filebeat
14759 14764 pts/7 00:00:00 filebeat
14759 14765 pts/7 00:00:00 filebeat
14759 14766 pts/7 00:00:00 filebeat
14759 14767 pts/7 00:00:00 filebeat
14759 14768 pts/7 00:03:04 filebeat
14759 14769 pts/7 00:00:00 filebeat
14759 14770 pts/7 00:00:00 filebeat
14759 14771 pts/7 00:00:00 filebeat
14759 14772 pts/7 00:00:00 filebeat
14759 14773 pts/7 00:00:00 filebeat
14759 14774 pts/7 00:00:00 filebeat
14759 14775 pts/7 00:00:00 filebeat
14759 14776 pts/7 00:00:00 filebeat
14759 14777 pts/7 00:00:00 filebeat
14759 14778 pts/7 00:00:00 filebeat
14759 14779 pts/7 00:02:42 filebeat
14759 14780 pts/7 00:02:36 filebeat
14759 14781 pts/7 00:00:00 filebeat
14759 14782 pts/7 00:00:00 filebeat
14759 14783 pts/7 00:00:00 filebeat
14759 14784 pts/7 00:00:00 filebeat
14759 14785 pts/7 00:00:00 filebeat
14759 14786 pts/7 00:00:00 filebeat
14759 14787 pts/7 00:00:00 filebeat
14759 14788 pts/7 00:00:00 filebeat
14759 14789 pts/7 00:00:00 filebeat
14759 14790 pts/7 00:00:00 filebeat
14759 14791 pts/7 00:00:00 filebeat
14759 14792 pts/7 00:00:00 filebeat
14759 14793 pts/7 00:00:00 filebeat
14759 14794 pts/7 00:00:00 filebeat
14759 14795 pts/7 00:00:00 filebeat
14759 14796 pts/7 00:00:00 filebeat
14759 14797 pts/7 00:00:00 filebeat
14759 14798 pts/7 00:00:00 filebeat
14759 14799 pts/7 00:00:00 filebeat
14759 14800 pts/7 00:00:00 filebeat
14759 14801 pts/7 00:00:00 filebeat
14759 14802 pts/7 00:00:00 filebeat
14759 14803 pts/7 00:00:00 filebeat
14759 14804 pts/7 00:00:00 filebeat
14759 14805 pts/7 00:00:00 filebeat
14759 14806 pts/7 00:00:00 filebeat
14759 14807 pts/7 00:02:23 filebeat
14759 14808 pts/7 00:00:00 filebeat
14759 14809 pts/7 00:00:00 filebeat
14759 14810 pts/7 00:00:00 filebeat
14759 14811 pts/7 00:00:00 filebeat
14759 14812 pts/7 00:00:00 filebeat
14759 14813 pts/7 00:00:00 filebeat
14759 14814 pts/7 00:00:00 filebeat
14759 14815 pts/7 00:00:00 filebeat
14759 14817 pts/7 00:02:43 filebeat
14759 14818 pts/7 00:01:12 filebeat
14759 14819 pts/7 00:00:00 filebeat
14759 14820 pts/7 00:00:00 filebeat

I could not drive up the filebeat CPU usage to 300% with any of config values (0, 1, 5, 10) . i am wondering where is the bottleneck? I plan to tune queue parameters with flush timeout set to 1.
(1) I observed increased throughput (w/o driving up CPU usage though) when setting bulk_max_size (in output section) = 4096. if i understand it correctly, the default value is 2048.
(2) set worker:2 did not provide meaningful throughput increase.
(3) my optimal config set for output to /dev/null seems to be
registry.flush: 1
bulk_max_size: 4096
worker:1 (even though i allocated 3 cores to filebeat).
My Q is: would these config values would still apply once I change output to kafka?
or I have benchmark all over again? that would be a pain:(

I was surprised to find out that filebeat created 62 os threads with $GOMAXPROCS = 3 (only 24 cores on this machine).

Most likely due to blocking IO. When having to do blocking calls, the go runtime create another OS thread to handle go-routines. Once the blocking call returns OS threads might time out and be removed. Using htop you can see OS threads + their CPU usage live.

I don't think bulk_max_size is supported by the console output.

My Q is: would these config values would still apply once I change output to kafka?
or I have benchmark all over again? that would be a pain:(

Right, you start optimizing the output here. If you change the output, then things might look very different. With kafka output, you will preceive back-pressure from Kafka and network directly.

(1) I observed increased throughput (w/o driving up CPU usage though) when setting bulk_max_size (in output section) = 4096. if i understand it correctly, the default value is 2048.

Depends on the output. But when modifying the bulk_max_size, also keep the queue size into account. The queue should be bigger bulk_max_size * worker, as events are split up into batches and you can only have max B = queue size / bulk_max_size batches in the pipeline. You want B > worker. In case of kafka output it doesn't make sense to increase workers, as the kafka output load balances events to the different brokers. Which means that one slow broker can stall filebeat (the slowest filebeat->broker connection mandates overall throughput).

1 Like

can you elaborate on this one? Why a single worker would lead to higher throughput?
I thought 1 worker would make it worse, since everything would be stalled on this single slow connection.

Thanks for the tips on the queue size! which of the following is the queue size? is it queue.mem.events? Does flush.min_events need to be ">=" bulk_max_size?
queue.mem:
events: 4096
flush.min_events: 512
flush.timeout: 5s

Kafka is different from all other outputs. In case of kafka the library we use, it creates a connection to each broker that holds an active partition for the topics you publish too. The kafka client libraries does the load balancing then. Having kafka in the chain you do not connect to a broker that you configured in the list of hosts, but the output bootstraps connections to all brokers in the kafka cluster by querying the broker for the cluster meta-data. The amount of load balancing you get depends on the kafka cluster itself, namely number of brokers and topics.
With kafka loads of logic on sending and balancing data is within the client, not the cluster.
The way the client is used is already asynchronous, meaning beats continue producing events (until queues are full), while the client asynchronously delivers already existing/buffered events.
Increasing workers will instantiate 2 kafka clients, each connecting to the same cluster, publishing to the same topics/partitions. This would introduce some contention. But as there is already some load balancing in place, they gain of load balancing the load balancing might be neglible. You can still try, but I don't think it improves throughput by much.

Due to the 'cluster' mode kafka client + brokers interoperate, it is normally assumed/required that clients and brokers are "nearby" within your network. Intermediaries like firewalls as single point of contact to separate networks can indeed impact throughput heavily. All in all, throughput can be very sensitive to your network structure. In some case reducing number of partitions and reduce amount of load balancing can even help with throughput.

queue.mem.events is the queue size. If queue.flush.min_events is reached, a set of events becomes availble for the output. The number of events the output processes at once is between queue.mem.flush.min_events and output.kafka.bulk_max_size. e.g. if output is busy, then a batch can grok up to output.kafka.bulk_max_size.
Due to the asynchronous nature of the kafka output, it can be beneficial to have smaller batches prepared. But this is some micro-optimization one needs to test if it really makes a difference.
If throughput is large enough, then the batch size itself might not make much of a difference at all. As long as you have enough batches in the system to keep the pipelines busy you should be fine.

There are a many factors impacting overall throughput. And each network/setup is different. Best tip I can give is: measure measure measure. Document changes and measurements. If a parameter makes a difference try to find the sweet spot. After modifying other variables, revisit some you modified in that past.
Tuning/Optimization is a global optimization problem, where a set of variables in a set of sub-systems must be optimized alltogether. Yet you normally can modify only a small set of variables/subsystems at once.

Tuning/changing parameters in Kafka (e.g. number of partitions) might even give you bigger gains that you get from tuning filebeat itself. We use the console ouput in order to get an idea of what the machine is capable of. Like an upper bound of what we can potentially collect. When introducing Kafka, you will get less throughput for sure.

1 Like

btw, when setting output to console, i got a set of optimal config parameters:
registry.flush=1s, queue size = 65536, queue.mem.flush.min_events = 65536, output.bulk_max_size=65536. btw, when upgrading filebeat from 6.1.3. to 7.2, we observed big improvement in throughput, it reports ~1M events(logs lines) per 30s interval. thanks for the improvements

However, when I changed the output to kafka, the throughput plummeted. Granted i need to re-turn these configs, still.. per the logs (copied below), there are lots of retries and fails. @steffens thanks for your previous super helpful tips, i am wondering do you have any quick insights on (1) why there are so many fails and retries, and throughput drops from 1M to 65K?
(2) what "fail" the filebeat log refers to? failed to send to kafka?
btw, it does drive CPU to 300% (we allocate 3 CPUs to filebeat) however, it seems being busy but doing little work.

2019-07-25T00:46:14.125Z|INFO|[monitoring]|log/log.go:145|Non-zero metrics in the last 30s|{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":5910,"time":{"ms":5912}},"total":{"ticks":87620,"time":{"ms":87628},"value":87620},"user":{"ticks":81710,"time":{"ms":81716}}},"handles":{"limit":{"hard":30000,"soft":30000},"open":258},"info":{"ephemeral_id":"05ec7c14-8e94-4c8e-986a-98e48995a95b","uptime":{"ms":30171}},"memstats":{"gc_next":1560903344,"memory_alloc":1574909320,"memory_total":43295363944,"rss":1811226624},"runtime":{"goroutines":1703}},"filebeat":{"events":{"active":65946,"added":65949,"done":3},"harvester":{"open_files":206,"running":206,"started":206}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"active":47037,"batches":40,"failed":851968,"total":899005},"type":"kafka"},"outputs":{"kafka":{"bytes_read":2986675,"bytes_write":228037934}},"pipeline":{"clients":1,"events":{"active":65557,"filtered":206,"published":65556,"retry":880506,"total":65763}}},"registrar":{"states":{"current":3,"update":3},"writes":{"success":4,"total":4}},"system":{"cpu":{"cores":3},"load":{"1":6.34,"15":8.41,"5":10.2,"norm":{"1":2.1133,"15":2.8033,"5":3.4}}}}}}

2019-07-25T00:46:44.034Z|INFO|[monitoring]|log/log.go:145|Non-zero metrics in the last 30s|{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":8440,"time":{"ms":2539}},"total":{"ticks":176550,"time":{"ms":88938},"value":176550},"user":{"ticks":168110,"time":{"ms":86399}}},"handles":{"limit":{"hard":30000,"soft":30000},"open":258},"info":{"ephemeral_id":"05ec7c14-8e94-4c8e-986a-98e48995a95b","uptime":{"ms":60152}},"memstats":{"gc_next":1562228448,"memory_alloc":1012408016,"memory_total":87924400880,"rss":3661824},"runtime":{"goroutines":1703}},"filebeat":{"harvester":{"open_files":206,"running":206}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"active":18499,"batches":47,"failed":983040,"total":1001539}},"outputs":{"kafka":{"bytes_read":1487515,"bytes_write":259234457}},"pipeline":{"clients":1,"events":{"active":65557,"retry":1001539}}},"registrar":{"states":{"current":3}},"system":{"load":{"1":6.14,"15":8.33,"5":9.77,"norm":{"1":2.0467,"15":2.7767,"5":3.2567}}}}}}

2019-07-25T00:47:14.161Z|INFO|[monitoring]|log/log.go:145|Non-zero metrics in the last 30s|{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":10670,"time":{"ms":2226}},"total":{"ticks":266170,"time":{"ms":89618},"value":266170},"user":{"ticks":255500,"time":{"ms":87392}}},"handles":{"limit":{"hard":30000,"soft":30000},"open":258},"info":{"ephemeral_id":"05ec7c14-8e94-4c8e-986a-98e48995a95b","uptime":{"ms":90153}},"memstats":{"gc_next":1841467904,"memory_alloc":1551755280,"memory_total":134154729784,"rss":11530240},"runtime":{"goroutines":1703}},"filebeat":{"harvester":{"open_files":206,"running":206}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":46,"failed":1030077,"total":1030077}},"outputs":{"kafka":{"bytes_read":1564021,"bytes_write":258310212}},"pipeline":{"clients":1,"events":{"active":65557,"retry":1030077}}},"registrar":{"states":{"current":3}},"system":{"load":{"1":5.7,"15":8.22,"5":9.32,"norm":{"1":1.9,"15":2.74,"5":3.1067}}}}}}|

btw, I tried a few different set of memory queue size, bulk_max_size and flush.min_events, it is always like the above.
If I understand it correctly, pipeline.events.published is filebeat throughput to kafka.
In all parameter set up I tried, this metric always follows the same pattern, its value for the first 30 seconds is ~ queue size (e.g., pipeline.events.published= 65536 when the queue size = 65536, and pipeline.events.published= 8212 when the queue size = 8192). pipeline.events.published is always 0, i.e., missing this field in [monitoring] log entries after the first 30 seconds.
so it seems like after the first batches of publishing, filebeat just got stuck??? I also tried to increase output.kafka.worker from 1 to 3, the results follows the same pattern.

Is the following logs ok, or indicates trouble? had lots of them in the log:
2019-07-25T19:50:10.863Z INFO [publisher] pipeline/retry.go:166 retryer: send wait signal to consumer
2019-07-25T19:50:10.863Z INFO [publisher] pipeline/retry.go:168 done
2019-07-25T19:50:10.885Z INFO [publisher] pipeline/retry.go:189 retryer: send unwait-signal to consumer
2019-07-25T19:50:10.885Z INFO [publisher] pipeline/retry.go:191 done

I am trying to understand the following in the logs below:

  1. why filebeat.events.done is only 8?
  2. why so many libbeat.output.events.failed?
  3. why so many pipeline.events.retry?
    2019-07-25T19:18:04.794Z "filebeat":{"events":{"active":65941,"added":65949,"done":8}, "libbeat":{"output":{"events"{"active":40960,"batches":103,"failed":802816,"total":843776},"type":"kafka"},"outputs":{"kafka":{"bytes_
    read":3017495,"bytes_write":224973369}},"pipeline":{"clients":1,"events":{"active":65557,"filtered":206,"published":65556,"retry":786432,"total":65763}}}

After the first 30s, logs always look like the following, why so many pipeline.events.retry and output.events.failed? it seems every events to kafka is failed and re-try.
2019-07-25T19:18:35.061Z "output":{"events":{"active":8192,"batches":119**,"failed":966656,"total":974848}},"outputs":{"kafka":{"bytes_read":1569256,"bytes_write":261213883}},"pipeline":{"clients":1,"events":
{"active":65557,
"retry":974848**}}}

BTW, this stuck behavior seems only happening in my perf benchmarking setup, not in real production. The only difference i can tell is: in real production, it tails ongoing logs; in my perf benchmark set up, it tails big log files from the scratch in every run since i use "rm -r data/*" to delete registry files.

btw, "setting output.kafka.required_acks: 0" seems unblocked filebeat, i.e., I started to see meaningful (non-zero) libbeat.pipeline.events.published values again after the first 30s.
hmm, does this mean that filebeat receive close to 0(if any) ACK from kafka previously?? since in the first 30s, libbeat.pipeline.events.published is ~ queue size, then after the first 30s, libbeat.pipeline.events.published is always 0.

hmm, after turning on debug log, i found out it is due to authorization issues. This seems should be error level logs.

Can you share the log messages that did help you identifying the issue? Unfortunately the library we use doesn't really make a difference between any log levels and our best chance we have is to filter on log message in order to decide on a log level.

The fail and retry count is much to high. Normally it should be at 0 and only sporadically go up by a very little, in case the is a hiccup in your network. With these high number of failures one needs to have a look at beast (best debug level) and kafka servere logs.

In a kafka cluster each partition in a topics has a set number of replicas. I think the default value of replicaes was even 0 by default. Better double check your topics here. Replicas are required to keep your system running in case a broker goes down. A replica can be either fully in sync, out not updated yet. If your broker goes down and there is no in-sync replica, then the partition is bound to fail (you can't publish or consume). If your for example a broker did run out of disk space (or some other error occurs), then the event can not be stored. If this happens to the leader + if there is no replica partition available, then you might not be able to publish any new events.
This is where required_acks comes into play. The default value of required_acks is 1. Meaning the filebeat waits for kafka to confirm that the event has been stored by the leader broker only (we don't wait for replicas). Setting it to a higher value decreases throughput, but increases reliability in your system, cause you know there is at least one replica available in your cluster once an event has been ACKed by Kafka. The reason required_acks: 0 unblocks filebeat is: 0 means we don't need an ACK. We don't care about any errors and just continue sending. Filebeat just assumes (without knowing for sure) that all was fine. Meanwhile kafka might be in an unstable state and happily drop all your events.
The documentation on required_acks also mentions this shortly. A value of 1 is not necessarily safe, but with 0 you're stating that you are completely fine wit data loss for whatever reason (not recommended).

1 Like

@steffens thanks for the explanation!
here is the log message on why Kafka publish failed, "The client is not authorized to access this topic." seems very useful, should be surfaced in default(INFO leve) logging level.
2019-08-01T20:33:24Z DBG [kafka] finished kafka batch
2019-08-01T20:33:24Z DBG [kafka] Kafka publish failed with: kafka server: The client is not authorized to access this topic.

btw, I agree that required_acks: 0 is not safe, I just temporarily set to 0 to debug what happened. Since we fixed ACL issues, we have set it back to 1.